neuralset.extractors.audio.Wav2Vec¶
- class neuralset.extractors.audio.Wav2Vec(*, model_name: str = 'facebook/wav2vec2-large-xlsr-53', device: Literal['auto', 'cpu', 'cuda', 'accelerate'] = 'auto', layers: float | list[float] | Literal['all'] = 0.6666666666666666, cache_n_layers: int | None = None, layer_aggregation: Literal['mean', 'sum', 'group_mean'] | None = 'mean', token_aggregation: Literal['first', 'last', 'mean', 'sum', 'max'] | None = 'mean', event_types: str | tuple[str, ...] = 'Audio', aggregation: Literal['single', 'sum', 'mean', 'first', 'middle', 'last', 'cat', 'stack', 'trigger'] = 'single', allow_missing: bool = False, frequency: Literal['native'] | float = 'native', norm_audio: bool = True, infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=25, nodes=1, tasks_per_node=1, cpus_per_task=8, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, conda_env=None, workdir=None, permissions=511, version='v5', keep_in_ram=True, max_jobs=128, min_samples_per_job=4096, forbid_single_item_computation=False, mode='cached'), normalized: bool = True, layer_type: Literal['transformer', 'convolution'] = 'transformer')[source][source]¶
Extract speech embeddings using a pretrained Wav2Vec 2.0 model from Hugging Face.
The Wav2Vec 2.0 architecture learns contextualized speech representations from raw audio waveforms using self-supervised pretraining on large multilingual audio corpora, and is widely used for tasks such as automatic speech recognition (ASR), speaker verification, and speech classification.
- Parameters:
model_name (str) – The Hugging Face model identifier to load, defaulting to
"facebook/wav2vec2-large-xlsr-53".