Dataset Loaders

        classDiagram
  ABC <|-- AbstractDatasetLoader
  ABC <|-- AssetDownloadManager
  ABC <|-- AssetStore
  AssetError <|-- AssetCardError
  BaseException <|-- Exception
  DatasetLoader <|-- AbstractDatasetLoader
  DatasetLoader <|-- DelegatingDatasetLoader
  Exception <|-- AssetError
  Generic <|-- Protocol
  Protocol <|-- DatasetLoader
  PurePath <|-- Path
    

The dataset loader system in fairseq2 provides a flexible and extensible way to load different types of datasets. The system uses the concept of dataset families to organize and manage different dataset formats.

Dataset Family

A dataset family represents a specific format or structure of data that requires specialized loading logic. Each dataset is associated with a family through the dataset_family field in its asset card.

Built-in Dataset Families

fairseq2 includes several built-in dataset families:

  • generic_text: For plain text datasets

  • generic_parallel_text: For parallel text/translation datasets

  • generic_asr: For automatic speech recognition datasets

  • generic_speech: For speech-only datasets

  • generic_instruction: For instruction-tuning datasets

  • generic_preference_optimization: For preference optimization datasets

Example Asset Card

name: librispeech_asr
dataset_family: generic_asr
tokenizer: "https://example.com/tokenizer.model"
tokenizer_family: char_tokenizer

Core Components

DatasetLoader Protocol

class fairseq2.datasets.loader.DatasetLoader(*args, **kwargs)[source]

Bases: Protocol[DatasetT_co]

Loads datasets of type DatasetT`.

__call__(dataset_name_or_card, *, force=False, progress=True)[source]
Parameters:
  • dataset_name_or_card (str | AssetCard) – The name or the asset card of the dataset to load.

  • force (bool) – If True, downloads the dataset even if it is already in cache.

  • progress (bool) – If True, displays a progress bar to stderr.

Return type:

DatasetT_co

AbstractDatasetLoader

class fairseq2.datasets.loader.AbstractDatasetLoader(*, asset_store=None, download_manager=None)[source]

Bases: ABC, DatasetLoader[DatasetT]

Provides a skeletal implementation of DatasetLoader.

Parameters:
  • asset_store (AssetStore | None) – The asset store where to check for available datasets. If None, the default asset store will be used.

  • download_manager (AssetDownloadManager | None) – The download manager. If None, the default download manager will be used.

DelegatingDatasetLoader

final class fairseq2.datasets.loader.DelegatingDatasetLoader(*, asset_store=None)[source]

Bases: DatasetLoader[DatasetT]

Loads datasets of type DatasetT using registered loaders.

Parameters:

asset_store (AssetStore | None) – The asset store where to check for available datasets. If None, the default asset store will be used.

register(family, loader)[source]

Register a dataset loader to use with this loader.

Parameters:
  • family (str) – The dataset type. If the ‘dataset_family’ field of an asset card matches this value, the specified loader will be used.

  • loader (DatasetLoader[DatasetT]) – The dataset loader.

supports(dataset_name_or_card)[source]

Return True if the specified dataset has a registered loader.

Return type:

bool

Utility Functions

fairseq2.datasets.loader.is_dataset_card(card)[source]

Return True if card specifies a dataset.

Return type:

bool

fairseq2.datasets.loader.get_dataset_family(card)[source]

Return the dataset family name contained in card.

Return type:

str

Usage Examples

1. Loading a Dataset Using Family

from fairseq2.datasets import load_text_dataset

# Load using dataset name (will look up asset card)
dataset = load_text_dataset("my_text_dataset")

# Load using explicit asset card
card = AssetCard(name="custom_dataset", dataset_family="generic_text")
dataset = load_text_dataset(card)

2. Registering a Custom Dataset Loader

from fairseq2.datasets import DelegatingDatasetLoader

# Create your custom dataset loader
class MyCustomDatasetLoader(AbstractDatasetLoader[MyDataset]):
    def _load(self, path: Path, card: AssetCard) -> MyDataset:
        return MyDataset.from_path(path)

# Register with a family name
loader = MyCustomDatasetLoader()
load_dataset = DelegatingDatasetLoader()
load_dataset.register("my_custom_family", loader)

See Also