fairseq2.datasets.hub

The dataset hub provides a centralized way to manage and access datasets in fairseq2. It offers functionality for discovering available datasets, loading datasets, and working with custom dataset configurations.

Core Classes

DatasetHub

final class fairseq2.datasets.hub.DatasetHub(family, asset_store)[source]

Bases: Generic[DatasetT, DatasetConfigT]

The main hub class for managing datasets. Provides methods for:

  • Listing available datasets (iter_cards())

  • Loading dataset configurations (get_dataset_config())

  • Opening datasets (open_dataset())

  • Opening custom datasets (open_custom_dataset())

Example usage:

get_my_dataset_hub = DatasetHubAccessor(
    MY_DATA_FAMILY_NAME, kls=MyDataset, config_kls=MyDatasetConfig
)

# Get the dataset hub
hub = get_my_dataset_hub()

# List all available datasets
for card in hub.iter_cards():
    print(f"Found dataset: {card.name}")

# Load a dataset configuration
config = hub.get_dataset_config("my_dataset")

# Open a dataset
dataset = hub.open_dataset("my_dataset")

# Open a custom dataset with specific configuration
custom_dataset = hub.open_custom_dataset(config)

DatasetHubAccessor

final class fairseq2.datasets.hub.DatasetHubAccessor(family_name, kls, config_kls)[source]

Bases: Generic[DatasetT, DatasetConfigT]

Factory class for creating DatasetHub instances for specific dataset families. Can be used by dataset implementors to create hub accessors for their dataset families.

Example implementation of a dataset hub accessor:

from fairseq2.datasets.hub import DatasetHubAccessor
from my_dataset import MyDataset, MyDatasetConfig

# Create a hub accessor for your dataset family
get_my_dataset_hub = DatasetHubAccessor(
    "my_dataset_family",  # dataset family name
    MyDataset,           # concrete dataset class
    MyDatasetConfig      # concrete dataset config class
)

Exceptions

DatasetNotKnownError

exception fairseq2.datasets.hub.DatasetNotKnownError(name)[source]

Bases: Exception

Raised when attempting to open a dataset that is not registered in the asset store.

Example:

try:
    dataset = hub.open_dataset("non_existent_dataset")
except DatasetNotKnownError as e:
    print(f"Dataset not found: {e.name}")

DatasetFamilyNotKnownError

exception fairseq2.datasets.hub.DatasetFamilyNotKnownError(name)[source]

Bases: Exception

Raised when attempting to access a dataset family that is not registered in the system.

Example:

try:
    hub = DatasetHubAccessor("unknown_family", MyDataset, MyConfig)()
except DatasetFamilyNotKnownError as e:
    print(f"Dataset family not found: {e.name}")

See Also