fairseq2.datasets¶
The datasets module provides flexibility in creating and managing datasets for various tasks. It supports both built-in datasets and custom dataset implementations.
Key Features¶
Dataset Family Registration: Datasets can be registered using
register_dataset_family
for seamless integrationFlexible Configuration: Dataset configurations can be defined through YAML asset cards
Dataset Registration and Asset Cards¶
Datasets in fairseq2 can be registered as follows:
- Register Dataset Families:
Use
register_dataset_family
to register custom datasets:from fairseq2.composition import register_dataset_family register_dataset_family( container, # DependencyContainer instance "custom_dataset", # family name CustomDataset, # dataset class CustomDatasetConfig, # config class opener=custom_opener # opener function )
This function can potentially be called:
in
Recipe.register()
(read more in Building Recipes), orin a fairseq2 extension function (read more in Runtime Extension).
- More Variants w/ Asset Cards:
Create YAML files describing your datasets in any of these locations:
Built-in cards:
fairseq2/assets/cards/
System-wide cards:
/etc/fairseq2/assets/
(overridden byFAIRSEQ2_ASSET_DIR
if set)User-specific cards:
~/.config/fairseq2/assets/
(overridden byFAIRSEQ2_USER_ASSET_DIR
if set)Recipe-local cards: Add to recipe config common section via
extra_paths
:
# Example: Adding recipe-local asset paths in config (absolute path) common: assets: extra_paths: ["/path/to/assets"] # or relative path # e.g. if the common section is in /path/to/recipe/config.yaml # it recursively retrieves assets from /path/to/recipe/assets common: assets: extra_paths: ["${dir}/assets"]
Creating Custom Datasets¶
Custom datasets can be created independently in recipes by:
Implementing a dataset class
Creating a configuration class
Registering the dataset family
Providing asset card(s) in YAML format
Here’s a basic example:
from fairseq2.composition import register_dataset_family
from fairseq2.datasets import DataReader
# 1. Dataset Implementation
class CustomDataset:
def create_reader(self, ...) -> DataReader:
# Implementation
# 2. Configuration Class
@dataclass
class CustomDatasetConfig:
"""
This configuration matches the keys after the top-level `dataset_config:` key
in the YAML asset definition:
```yaml
name: mydataset
dataset_config:
data: (all keys here must have a companion parameter in this config)
```
"""
path: Path
# Other config options
# 3. Register in Recipe
class YourRecipe(TrainRecipe): # or EvalRecipe/GenerationRecipe
@override
def register(self, container: DependencyContainer) -> None:
register_dataset_family(
container,
"custom_dataset", # family name
CustomDataset, # dataset class
CustomDatasetConfig, # config class
opener=open_custom_dataset # opener function
)
The corresponding dataset asset card in YAML format could be for example:
name: mydataset
dataset_family: custom_dataset
---
name: mydataset@user
dataset_config:
data: "/path/to/local/datasets/librilight/10h"
---
name: mydataset@mycluster
dataset_config:
data: "/path/to/cluster/datasets/librilight/10h"
Advanced Dataset Opening¶
While the basic opener
function is sufficient for most use cases, fairseq2 also provides
an advanced opening mechanism through advanced_opener
for cases where access to fairseq2’s
dependency injection system is needed.
The key differences are:
opener(config) -> Dataset
: Simple function that takes only the configadvanced_opener(resolver, config) -> Dataset
: Takes aDependencyResolver
as first parameter
The DependencyResolver
provides access to other services and objects registered in fairseq2,
making it useful for more complex dataset implementations that need to interact with other
parts of the system.
Example usage:
from fairseq2.dependency import DependencyResolver
def my_advanced_opener(resolver: DependencyResolver, config: MyDatasetConfig) -> MyDataset:
# Access other fairseq2 objects through the resolver
some_object = resolver.get("object_name")
# Use the object in dataset creation
return MyDataset(config, some_object)
# Register with advanced opener
register_dataset_family(
container,
"my_dataset",
MyDataset,
MyDatasetConfig,
advanced_opener=my_advanced_opener # Note: don't provide both opener and advanced_opener
)
Note
You must provide either opener
or advanced_opener
, but not both. For simple
dataset implementations that don’t need access to other fairseq2 objects, using
the basic opener
is recommended.