neuralset.extractors.audio.MelSpectrum

pydantic model neuralset.extractors.audio.MelSpectrum[source][source]

Compute the Mel spectrogram representation of an audio waveform.

This feature extracts a Mel-scaled power spectrogram from raw waveform data, converting time-domain audio into a frequency-domain representation that emphasizes perceptually relevant frequency bands. The resulting tensor can optionally be log-scaled for improved numerical stability and interpretability.

Parameters:
  • n_mels (int, default=40) – Number of Mel filter banks to use when computing the Mel spectrogram.

  • n_fft (int, default=512) – Size of the FFT window used to compute the short-time Fourier transform (STFT).

  • hop_length (int or None, default=None) – Number of samples between successive frames. Defaults to n_fft // 4 if not set.

  • normalized (bool, default=True) – If True, normalize the spectrogram output.

  • use_log_scale (bool, default=True) – If True, apply a logarithmic transformation (base 10) to the Mel spectrum.

  • log_scale_eps (float, default=1e-5) – Small constant added to the Mel spectrum before taking the logarithm, to avoid numerical issues with log(0).

Fields:
field n_mels: int = 40[source]
field n_fft: int = 512[source]
field hop_length: int | None = None[source]
field normalized: bool = True[source]
field use_log_scale: bool = True[source]
field log_scale_eps: float = 1e-05[source]
requirements: ClassVar[tuple[str, ...]] = ('julius>=0.2.7', 'pillow>=9.2.0', 'julius>=0.2.7', 'pillow>=9.2.0', 'torchaudio', 'soundfile')[source]