neuralset.extractors.audio.HuggingFaceAudio¶
- class neuralset.extractors.audio.HuggingFaceAudio(*, model_name: str = 'facebook/wav2vec2-large-xlsr-53', device: Literal['auto', 'cpu', 'cuda', 'accelerate'] = 'auto', layers: float | list[float] | Literal['all'] = 0.6666666666666666, cache_n_layers: int | None = None, layer_aggregation: Literal['mean', 'sum', 'group_mean'] | None = 'mean', token_aggregation: Literal['first', 'last', 'mean', 'sum', 'max'] | None = 'mean', event_types: str | tuple[str, ...] = 'Audio', aggregation: Literal['single', 'sum', 'mean', 'first', 'middle', 'last', 'cat', 'stack', 'trigger'] = 'single', allow_missing: bool = False, frequency: Literal['native'] | float = 'native', norm_audio: bool = True, infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=25, nodes=1, tasks_per_node=1, cpus_per_task=8, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, conda_env=None, workdir=None, permissions=511, version='v5', keep_in_ram=True, max_jobs=128, min_samples_per_job=4096, forbid_single_item_computation=False, mode='cached'), normalized: bool = True, layer_type: Literal['transformer', 'convolution'] = 'transformer')[source][source]¶
Base class for extracting audio features from Hugging Face models.
This class provides a unified interface to load and process pretrained Hugging Face audio models such as Wav2Vec2, HuBERT, or XLS-R. It supports both convolutional and transformer layer outputs and handles feature extraction, model management, and layer aggregation automatically.
Some model types should be used through their subclasses as special handling is required (e.g., Whisper, SeamlessM4T, or Wav2VecBert).
- Parameters:
model_name (str, default='facebook/wav2vec2-large-xlsr-53') – Name or path of the pretrained Hugging Face model to load.
normalized (bool, default=True) – Whether to normalize the input waveform before feature extraction.
layer_type ({'transformer', 'convolution'}, default='transformer') – Which internal representation to extract from the model: -
'transformer'returns hidden states from transformer layers. -'convolution'returns convolutional feature maps.