fairseq2.datasets.hub

The dataset hub provides a centralized way to manage and access datasets in fairseq2. It offers functionality for discovering available datasets, loading datasets, and working with custom dataset configurations.

Core Classes

DatasetHub

final class fairseq2.datasets.hub.DatasetHub(family: DatasetFamily, asset_store: AssetStore)[source]

Bases: Generic[DatasetT, DatasetConfigT]

The main hub class for managing datasets. Provides methods for:

Example usage:

get_my_dataset_hub = DatasetHubAccessor(
    MY_DATA_FAMILY_NAME, kls=MyDataset, config_kls=MyDatasetConfig
)

# Get the dataset hub
hub = get_my_dataset_hub()

# List all available datasets
for card in hub.iter_cards():
    print(f"Found dataset: {card.name}")

# Load a dataset configuration
config = hub.get_dataset_config("my_dataset")

# Open a dataset
dataset = hub.open_dataset("my_dataset")

# Open a custom dataset with specific configuration
custom_dataset = hub.open_custom_dataset(config)
iter_cards() Iterator[AssetCard][source]
get_dataset_config(card: AssetCard | str) DatasetConfigT[source]
open_dataset(card: AssetCard | str, *, config: DatasetConfigT | None = None) DatasetT[source]
open_custom_dataset(config: DatasetConfigT) DatasetT[source]

DatasetHubAccessor

final class fairseq2.datasets.hub.DatasetHubAccessor(family_name: str, kls: type[DatasetT], config_kls: type[DatasetConfigT])[source]

Bases: Generic[DatasetT, DatasetConfigT]

Factory class for creating DatasetHub instances for specific dataset families. Can be used by dataset implementors to create hub accessors for their dataset families.

Example implementation of a dataset hub accessor:

from fairseq2.datasets.hub import DatasetHubAccessor
from my_dataset import MyDataset, MyDatasetConfig

# Create a hub accessor for your dataset family
get_my_dataset_hub = DatasetHubAccessor(
    "my_dataset_family",  # dataset family name
    MyDataset,           # concrete dataset class
    MyDatasetConfig      # concrete dataset config class
)

Exceptions

DatasetNotKnownError

exception fairseq2.datasets.hub.DatasetNotKnownError(name: str)[source]

Bases: Exception

Raised when attempting to open a dataset that is not registered in the asset store.

Example:

try:
    dataset = hub.open_dataset("non_existent_dataset")
except DatasetNotKnownError as e:
    print(f"Dataset not found: {e.name}")

DatasetFamilyNotKnownError

exception fairseq2.datasets.hub.DatasetFamilyNotKnownError(name: str)[source]

Bases: Exception

Raised when attempting to access a dataset family that is not registered in the system.

Example:

try:
    hub = DatasetHubAccessor("unknown_family", MyDataset, MyConfig)()
except DatasetFamilyNotKnownError as e:
    print(f"Dataset family not found: {e.name}")

See Also