neuralset.extractors.audio.SonarAudio

class neuralset.extractors.audio.SonarAudio(*, event_types: str | tuple[str, ...] = 'Audio', aggregation: Literal['single', 'sum', 'mean', 'first', 'middle', 'last', 'cat', 'stack', 'trigger'] = 'single', allow_missing: bool = False, frequency: Literal['native'] | float = 'native', norm_audio: bool = True, infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=25, nodes=1, tasks_per_node=1, cpus_per_task=8, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, conda_env=None, workdir=None, permissions=511, version='v5', keep_in_ram=True, max_jobs=128, min_samples_per_job=4096, forbid_single_item_computation=False, mode='cached'), sampling_rate: int = 16000, layer: float = 0.5)[source][source]

Extract deep audio embeddings from waveforms using the Sonar speech encoder.

SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations

This extractor leverages the sonar_speech_encoder_eng model to produce speech sentence embeddings.

Parameters:
  • sampling_rate (int, default=16_000) – The input sampling rate expected by the Sonar model.

  • layer (float, default=0.5) – The relative layer from which to extract the embedding (0=first layer, 1.= last layer).