Dataset Loaders¶
classDiagram ABC <|-- AbstractDatasetLoader ABC <|-- AssetDownloadManager ABC <|-- AssetStore AssetError <|-- AssetCardError BaseException <|-- Exception DatasetLoader <|-- AbstractDatasetLoader DatasetLoader <|-- DelegatingDatasetLoader Exception <|-- AssetError Generic <|-- Protocol Protocol <|-- DatasetLoader PurePath <|-- Path
The dataset loader system in fairseq2 provides a flexible and extensible way to load different types of datasets. The system uses the concept of dataset families to organize and manage different dataset formats.
Dataset Family¶
A dataset family represents a specific format or structure of data that requires specialized loading logic.
Each dataset is associated with a family through the dataset_family
field in its asset card.
Built-in Dataset Families¶
fairseq2 includes several built-in dataset families:
generic_text
: For plain text datasetsgeneric_parallel_text
: For parallel text/translation datasetsgeneric_asr
: For automatic speech recognition datasetsgeneric_speech
: For speech-only datasetsgeneric_instruction
: For instruction-tuning datasetsgeneric_preference_optimization
: For preference optimization datasets
Example Asset Card¶
name: librispeech_asr
dataset_family: generic_asr
tokenizer: "https://example.com/tokenizer.model"
tokenizer_family: char_tokenizer
Core Components¶
DatasetLoader Protocol¶
AbstractDatasetLoader¶
- class fairseq2.datasets.loader.AbstractDatasetLoader(*, asset_store=None, download_manager=None)[source]¶
Bases:
ABC
,DatasetLoader
[DatasetT
]Provides a skeletal implementation of
DatasetLoader
.- Parameters:
asset_store (AssetStore | None) – The asset store where to check for available datasets. If
None
, the default asset store will be used.download_manager (AssetDownloadManager | None) – The download manager. If
None
, the default download manager will be used.
DelegatingDatasetLoader¶
- final class fairseq2.datasets.loader.DelegatingDatasetLoader(*, asset_store=None)[source]¶
Bases:
DatasetLoader
[DatasetT
]Loads datasets of type
DatasetT
using registered loaders.- Parameters:
asset_store (AssetStore | None) – The asset store where to check for available datasets. If
None
, the default asset store will be used.
- register(family, loader)[source]¶
Register a dataset loader to use with this loader.
- Parameters:
family (str) – The dataset type. If the ‘dataset_family’ field of an asset card matches this value, the specified
loader
will be used.loader (DatasetLoader[DatasetT]) – The dataset loader.
Utility Functions¶
Usage Examples¶
1. Loading a Dataset Using Family¶
from fairseq2.datasets import load_text_dataset
# Load using dataset name (will look up asset card)
dataset = load_text_dataset("my_text_dataset")
# Load using explicit asset card
card = AssetCard(name="custom_dataset", dataset_family="generic_text")
dataset = load_text_dataset(card)
2. Registering a Custom Dataset Loader¶
from fairseq2.datasets import DelegatingDatasetLoader
# Create your custom dataset loader
class MyCustomDatasetLoader(AbstractDatasetLoader[MyDataset]):
def _load(self, path: Path, card: AssetCard) -> MyDataset:
return MyDataset.from_path(path)
# Register with a family name
loader = MyCustomDatasetLoader()
load_dataset = DelegatingDatasetLoader()
load_dataset.register("my_custom_family", loader)