fairseq2.datasets

The datasets module provides flexibility in creating and managing datasets for various tasks. It supports both built-in datasets and custom dataset implementations.

Key Features

  • Dataset Family Registration: Datasets can be registered using register_dataset_family for seamless integration

  • Flexible Configuration: Dataset configurations can be defined through YAML asset cards

Dataset Registration and Asset Cards

Datasets in fairseq2 can be registered as follows:

  1. Register Dataset Families:

    Use register_dataset_family to register custom datasets:

     from fairseq2.composition import register_dataset_family
    
    register_dataset_family(
        container,             # DependencyContainer instance
        "custom_dataset",      # family name
        CustomDataset,         # dataset class
        CustomDatasetConfig,   # config class
        opener=custom_opener   # opener function
    )
    

    This function can potentially be called:

  2. More Variants w/ Asset Cards:

    Create YAML files describing your datasets in any of these locations:

    • Built-in cards: fairseq2/assets/cards/

    • System-wide cards: /etc/fairseq2/assets/ (overridden by FAIRSEQ2_ASSET_DIR if set)

    • User-specific cards: ~/.config/fairseq2/assets/ (overridden by FAIRSEQ2_USER_ASSET_DIR if set)

    • Recipe-local cards: Add to recipe config common section via extra_paths:

    # Example: Adding recipe-local asset paths in config (absolute path)
    common:
        assets:
            extra_paths: ["/path/to/assets"]
    
    # or relative path
    # e.g. if the common section is in /path/to/recipe/config.yaml
    # it recursively retrieves assets from /path/to/recipe/assets
    common:
        assets:
            extra_paths: ["${dir}/assets"]
    

Creating Custom Datasets

Custom datasets can be created independently in recipes by:

  1. Implementing a dataset class

  2. Creating a configuration class

  3. Registering the dataset family

  4. Providing asset card(s) in YAML format

Here’s a basic example:

from fairseq2.composition import register_dataset_family
from fairseq2.datasets import DataReader

# 1. Dataset Implementation
class CustomDataset:
    def create_reader(self, ...) -> DataReader:
        # Implementation

# 2. Configuration Class
@dataclass
class CustomDatasetConfig:
    """
    This configuration matches the keys after the top-level `dataset_config:` key
    in the YAML asset definition:

    ```yaml
    name: mydataset
    dataset_config:
        data: (all keys here must have a companion parameter in this config)
    ```
    """
    path: Path
    # Other config options

# 3. Register in Recipe
class YourRecipe(TrainRecipe):  # or EvalRecipe/GenerationRecipe
    @override
    def register(self, container: DependencyContainer) -> None:
        register_dataset_family(
            container,
            "custom_dataset",           # family name
            CustomDataset,              # dataset class
            CustomDatasetConfig,        # config class
            opener=open_custom_dataset  # opener function
        )

The corresponding dataset asset card in YAML format could be for example:

name: mydataset
dataset_family: custom_dataset

---

name: mydataset@user
dataset_config:
    data: "/path/to/local/datasets/librilight/10h"

---

name: mydataset@mycluster
dataset_config:
    data: "/path/to/cluster/datasets/librilight/10h"

Advanced Dataset Opening

While the basic opener function is sufficient for most use cases, fairseq2 also provides an advanced opening mechanism through advanced_opener for cases where access to fairseq2’s dependency injection system is needed.

The key differences are:

  • opener(config) -> Dataset: Simple function that takes only the config

  • advanced_opener(resolver, config) -> Dataset: Takes a DependencyResolver as first parameter

The DependencyResolver provides access to other services and objects registered in fairseq2, making it useful for more complex dataset implementations that need to interact with other parts of the system.

Example usage:

from fairseq2.dependency import DependencyResolver

def my_advanced_opener(resolver: DependencyResolver, config: MyDatasetConfig) -> MyDataset:
    # Access other fairseq2 objects through the resolver
    some_object = resolver.get("object_name")

    # Use the object in dataset creation
    return MyDataset(config, some_object)

# Register with advanced opener
register_dataset_family(
    container,
    "my_dataset",
    MyDataset,
    MyDatasetConfig,
    advanced_opener=my_advanced_opener  # Note: don't provide both opener and advanced_opener
)

Note

You must provide either opener or advanced_opener, but not both. For simple dataset implementations that don’t need access to other fairseq2 objects, using the basic opener is recommended.