audiocraft

AudioCraft conditioning modules

AudioCraft provides a modular implementation of conditioning modules that can be used with the language model to condition the generation. The codebase was developed in order to easily extend the set of modules currently supported to easily develop new ways of controlling the generation.

Conditioning methods

For now, we support 3 main types of conditioning within AudioCraft:

The Language Model relies on 2 core components that handle processing information:

Different conditioners (for text, waveform, joint embeddings…) are provided as torch modules in AudioCraft and are used internally in the language model to process the conditioning signals and feed them to the language model.

Core concepts

Conditioners

The BaseConditioner torch module is the base implementation for all conditioners in AudioCraft.

Each conditioner is expected to implement 2 methods:

ConditionProvider

The ConditionProvider prepares and provides conditions given a dictionary of conditioners.

Conditioners are specified as a dictionary of attributes and the corresponding conditioner providing the processing logic for the given attribute.

Similarly to the conditioners, the condition provider works in two steps to avoid synchronization points:

The list of conditioning attributes is passed as a list of ConditioningAttributes that is presented just below.

ConditionFuser

Once all conditioning signals have been extracted and processed by the ConditionProvider as dense embeddings, they remain to be passed to the language model along with the original language model inputs.

The ConditionFuser handles specifically the logic to combine the different conditions to the actual model input, supporting different strategies to combine them.

One can therefore define different strategies to combine or fuse the condition to the input, in particular:

SegmentWithAttributes and ConditioningAttributes: From metadata to conditions

The ConditioningAttributes dataclass is the base class for metadata containing all attributes used for conditioning the language model.

It currently supports the following types of attributes:

These different types of attributes are the attributes that are processed by the different conditioners.

ConditioningAttributes are extracted from metadata loaded along the audio in the datasets, provided that the metadata used by the dataset implements the SegmentWithAttributes abstraction.

All metadata-enabled datasets to use for conditioning in AudioCraft inherits the audiocraft.data.info_dataset.InfoAudioDataset class and the corresponding metadata inherits and implements the SegmentWithAttributes abstraction. Refer to the audiocraft.data.music_dataset.MusicAudioDataset class as an example.

Available conditioners

Text conditioners

All text conditioners are expected to inherit from the TextConditioner class.

AudioCraft currently provides two text conditioners:

Waveform conditioners

All waveform conditioners are expected to inherit from the WaveformConditioner class and consist of a conditioning method that takes a waveform as input. The waveform conditioner must implement the logic to extract the embedding from the waveform and define the downsampling factor from the waveform to the resulting embedding.

The ChromaStemConditioner conditioner is a waveform conditioner for the chroma features conditioning used by MusicGen. It takes a given waveform, extracts relevant stems for melody (namely all non drums and bass stems) using a pre-trained Demucs model and then extracts the chromagram bins from the remaining mix of stems.

Joint embeddings conditioners

We finally provide support for conditioning based on joint text and audio embeddings through the JointEmbeddingConditioner class and the CLAPEmbeddingConditioner that implements such a conditioning method relying on a pretrained CLAP model.

Classifier Free Guidance

We provide a Classifier Free Guidance implementation in AudioCraft. With the classifier free guidance dropout, all attributes are dropped with the same probability.

Attribute Dropout

We further provide an attribute dropout strategy. Unlike the classifier free guidance dropout, the attribute dropout drops given attributes with a defined probability, allowing the model not to expect all conditioning signals to be provided at once.

Faster computation of conditions

Conditioners that require some heavy computation on the waveform can be cached, in particular the ChromaStemConditioner or CLAPEmbeddingConditioner. You just need to provide the cache_path parameter to them. We recommend running dummy jobs for filling up the cache quickly. An example is provided in the musicgen.musicgen_melody_32khz grid.