fairseq2.datasets.hub¶
The dataset hub provides a centralized way to manage and access datasets in fairseq2. It offers functionality for discovering available datasets, loading datasets, and working with custom dataset configurations.
Core Classes¶
DatasetHub¶
- final class fairseq2.datasets.hub.DatasetHub(family, asset_store)[source]¶
Bases:
Generic
[DatasetT
,DatasetConfigT
]The main hub class for managing datasets. Provides methods for:
Listing available datasets (
iter_cards()
)Loading dataset configurations (
get_dataset_config()
)Opening datasets (
open_dataset()
)Opening custom datasets (
open_custom_dataset()
)
Example usage:
get_my_dataset_hub = DatasetHubAccessor( MY_DATA_FAMILY_NAME, kls=MyDataset, config_kls=MyDatasetConfig ) # Get the dataset hub hub = get_my_dataset_hub() # List all available datasets for card in hub.iter_cards(): print(f"Found dataset: {card.name}") # Load a dataset configuration config = hub.get_dataset_config("my_dataset") # Open a dataset dataset = hub.open_dataset("my_dataset") # Open a custom dataset with specific configuration custom_dataset = hub.open_custom_dataset(config)
DatasetHubAccessor¶
- final class fairseq2.datasets.hub.DatasetHubAccessor(family_name, kls, config_kls)[source]¶
Bases:
Generic
[DatasetT
,DatasetConfigT
]Factory class for creating
DatasetHub
instances for specific dataset families. Can be used by dataset implementors to create hub accessors for their dataset families.Example implementation of a dataset hub accessor:
from fairseq2.datasets.hub import DatasetHubAccessor from my_dataset import MyDataset, MyDatasetConfig # Create a hub accessor for your dataset family get_my_dataset_hub = DatasetHubAccessor( "my_dataset_family", # dataset family name MyDataset, # concrete dataset class MyDatasetConfig # concrete dataset config class )
Exceptions¶
DatasetNotKnownError¶
- exception fairseq2.datasets.hub.DatasetNotKnownError(name)[source]¶
Bases:
Exception
Raised when attempting to open a dataset that is not registered in the asset store.
Example:
try: dataset = hub.open_dataset("non_existent_dataset") except DatasetNotKnownError as e: print(f"Dataset not found: {e.name}")
DatasetFamilyNotKnownError¶
- exception fairseq2.datasets.hub.DatasetFamilyNotKnownError(name)[source]¶
Bases:
Exception
Raised when attempting to access a dataset family that is not registered in the system.
Example:
try: hub = DatasetHubAccessor("unknown_family", MyDataset, MyConfig)() except DatasetFamilyNotKnownError as e: print(f"Dataset family not found: {e.name}")
See Also¶
fairseq2.data.tokenizers.hub for tokenizer hub reference documentation.
fairseq2.models.hub for model hub reference documentation.