In fairseq2, “assets” refer to the various components that make up a machine learning task, such as models, datasets, tokenizers, and other resources.
These assets are essential for training, evaluating, and deploying models.
The fairseq2.assets module provides a unified API to manage and load these different assets using “asset cards” from various “stores”.
To add a custom dataset, create an asset card with the required fields:
name:my_custom_datasetdataset_family:generic_instructiondataset_config:# free-form configuration for the dataset
Required Fields for Datasets:
name: Unique identifier for the dataset
dataset_family: The dataset loader family to use
dataset_config: Free-form configuration for the dataset loader.
For dataset_config, you can use any configuration that the dataset loader supports.
You can find an example of registering a custom dataset loader below.
Example: Registering a custom dataset loader
fromfairseq2.compositionimportregister_dataset_familyfromfairseq2.recipeimportTrainRecipe@finalclassMyDataset:...@classmethoddeffrom_path(cls,path:Path)->"MyDataset":...@dataclassclassMyDatasetConfig:"""A dummy dataset config for demonstration purposes."""data:Path=field(default_factory=Path)defopen_my_dataset(config:MyDatasetConfig)->MyDataset:"""The mapping between the dataset asset card definition and MyDataset."""returnMyDataset.from_path(config.data)@finalclassMyRecipe(TrainRecipe):"""A dummy train recipe."""@overridedefregister(self,container:DependencyContainer)->None:register_dataset_family(container,"my_dataset",# dataset family nameMyDataset,MyDatasetConfig,opener=open_my_dataset,)
To add a custom model, you need both the architecture configuration and the asset card:
name:my_custom_model@usermodel_family:llamamodel_arch:llama3_8b# Use existing architecturecheckpoint:"/path/to/my/model.pt"tokenizer:"hg://meta-llama/Llama-3-8b"tokenizer_family:llama
Required Fields for Models:
name: Unique identifier for the model
model_family: The model family (e.g., ‘llama’, ‘qwen’, ‘mistral’)
Assets can have environment-specific configurations using the @environment syntax:
# Base configurationname:my_modelmodel_family:llamamodel_arch:llama3_8bcheckpoint:"hg://meta-llama/Llama-3-8b"---# User-specific overridename:my_model@userbase:my_modelcheckpoint:"/home/user/models/my_custom_llama.pt"---# Cluster-specific overridename:my_model@my_clusterbase:my_modelmodel_config:max_seq_len:4096# Shorter context for production
For more detailed information about registering asset cards on various clusters, please see the Runtime Extension documentation.
Assets can inherit from other assets using the base field:
# Base model configurationname:base_modelmodel_family:qwenmodel_arch:qwen25_7btokenizer_family:qwen---# Instruct version inheriting from basename:base_model_instructbase:base_modelcheckpoint:"hg://qwen/qwen2.5-7b-instruct"tokenizer:"hg://qwen/qwen2.5-7b-instruct"tokenizer_config:use_im_end:true
e.g. my_package/cards/ (if registered with register_package_assets(container,"my_package.cards"))
N/A (package resources)
If you are working with recipe, you can also specify the asset store to use with the config override --configcommon.asset.extra_paths="['/path/to/assets/dir','/path/to/yet_other_assets/dir']" option.
You can register additional asset directories programmatically:
frompathlibimportPathfromfairseq2.compositionimportregister_file_assetsdefsetup_my_fairseq2(container:DependencyContainer)->None:register_file_assets(container,Path("/path/to/my/assets"))init_fairseq2(extras=setup_my_fairseq2)# or register via setuptools entry_point.# or via `Recipe.register()`
For more detailed information about registering via setuptools, please see the Runtime Extension documentation.
Dynamic Asset Creation:
>>> fromfairseq2importDependencyContainer,init_fairseq2>>> fromfairseq2.assetsimportget_asset_store>>> fromfairseq2.compositionimportregister_in_memory_assets>>> entries=[{"name":"foo1","model_family":"foo"},{"name":"foo2","model_family":"foo"}]>>> defsetup_fs2_extension(container:DependencyContainer)->None:... register_in_memory_assets(container,source="my_in_mem_source",entries=entries)...>>> _=init_fairseq2(extras=setup_fs2_extension)>>> # Now you can load the asset>>> asset_store=get_asset_store()>>> asset_store.retrieve_card("foo1")foo1={'model_family': 'foo', '__source__': 'my_in_mem_source'}
name:custom_qwenmodel_family:qwenmodel_arch:qwen25_7bcheckpoint:"hg://qwen/qwen2.5-7b"tokenizer:"hg://qwen/qwen2.5-7b"tokenizer_family:qwen# Override model configurationmodel_config:max_seq_len:8192# Custom sequence lengthdropout_p:0.1# Add dropout for fine-tuning# Override tokenizer configurationtokenizer_config:use_im_end:true# Use special end tokensmax_length:8192# Match model sequence length
# Check if asset existsfromfairseq2.assetsimportget_asset_storefromfairseq2.assets.storeimportAssetNotFoundErrorasset_store=get_asset_store()print("Available assets:",list(asset_store.asset_names))# Try to load assettry:card=asset_store.retrieve_card("my_asset")print(f"Found: {card.name}")exceptAssetNotFoundErrorase:print(f"Asset not found: {e}")
Path Issues:
# Check asset directoriesecho"System: $FAIRSEQ2_ASSET_DIR"echo"User: $FAIRSEQ2_USER_ASSET_DIR"echo"Cache: $FAIRSEQ2_CACHE_DIR"# List user asset directory
ls-la~/.config/fairseq2/assets/