fairseq2 v0.5 introduces a simplified approach to creating custom training recipes.
This tutorial shows how to build a complete language model pretraining recipe with just a few focused classes, showcasing the power and flexibility of the new recipe system.
The new recipe system eliminates much of the complexity found in earlier versions, allowing you to focus on what matters most: your model, data, and training logic.
A complete custom recipe consists of just four main components:
flowchart LR
%% Styling
classDef recipeBox fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b
classDef configBox fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
classDef datasetBox fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
classDef entryBox fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100
R[Recipe Class<br/>LMTrainRecipe]:::recipeBox
C[Configuration<br/>LMTrainConfig]:::configBox
D[Dataset<br/>LMTrainDataset]:::datasetBox
E[Entry Point<br/>__main__.py]:::entryBox
E --> R
R --> C
R --> D
C --> D
Minimal boilerplate: Just 3 methods to override in your recipe class
Automatic dependency injection: Components are wired together automatically
Type-safe configuration: Dataclass-based configs with IDE support
Pluggable datasets: Easy to swap data sources and formats
One-line execution: Single command to run your recipe
Let’s go through the language model pretraining recipe step by step.
fromfairseq2.recipe.cliimporttrain_mainfrom.recipeimportLMTrainRecipe# Create recipe instancerecipe=LMTrainRecipe()# Run training with automatic CLI handlingtrain_main(recipe)
That’s it! Just 3 lines of code to create a complete training entry point.
What :meth:`fairseq2.recipe.cli.train_main` provides automatically:
Command line argument parsing with recipe-specific options
Configuration loading and validation from files or command line
Distributed training setup with proper process group initialization
Logging configuration with structured output and metrics
Error handling with graceful shutdown and debugging support
Checkpoint management with automatic save/load functionality
Then what we need to do is to build our LMTrainRecipe class.
As a quick preview, here is the skeleton of the recipe class:
fromfairseq2.recipe.baseimportRecipeContext,TrainRecipe@finalclassLMTrainRecipe(TrainRecipe):"""Language model pretraining recipe."""@overridedefregister(self,container:DependencyContainer)->None:"""Register dataset family with the dependency container."""register_dataset_family(...)@overridedefcreate_trainer(self,context:RecipeContext)->Trainer:"""Create the trainer with model and data configuration."""...# TODO: build this config class for our recipeconfig=context.config.as_(LMTrainConfig)# TODO: build this Train unit class which defines loss computationunit=LMTrainUnit(context.model)# TODO: build the dataset and create data readerdataset=context.default_dataset.as_(LMTrainDataset)data_reader=dataset.create_reader(...)# Create trainer using context helperreturncontext.create_trainer(unit,data_reader)@property@overridedefconfig_kls(self)->type[object]:"""Return the configuration class for this recipe."""returnLMTrainConfig
Configuration in fairseq2 uses dataclasses with sensible defaults and clear structure:
File: ``config.py``
@dataclass(kw_only=True)classLMTrainConfig:"""Configuration for language model pretraining."""# Model configurationmodel:ModelSection=field(default_factory=lambda:ModelSection(...))# Dataset configurationdataset:LMTrainDatasetSection=field(default_factory=lambda:LMTrainDatasetSection(...))# Tokenizer selectiontokenizer:TokenizerSection=field(default_factory=lambda:TokenizerSection(...))# Distributed training setupgang:GangSection=field(default_factory=lambda:GangSection())# Training parameterstrainer:TrainerSection=field(default_factory=lambda:TrainerSection(...))# Optimizer configurationoptimizer:OptimizerSection=field(default_factory=lambda:OptimizerSection(...))# Learning rate schedulerlr_scheduler:LRSchedulerSection|None=field(default_factory=lambda:LRSchedulerSection(...))# Training regimeregime:RegimeSection=field(default_factory=lambda:RegimeSection(...))# Common settingscommon:CommonSection=field(default_factory=lambda:CommonSection(...))@dataclass(kw_only=True)classLMTrainDatasetSection(DatasetSection):"""Dataset-specific configuration parameters."""...
Simple Structure: Each section controls a specific aspect of training
Sensible Defaults: Ready-to-use settings for beginners
Type Safety: Full IDE support with autocompletion
Customizable: Easy to override values via command line or config files
The dataset component handles data loading and preprocessing:
File: `dataset.py`
@finalclassLMTrainDataset:"""Language model training dataset supporting JSONL files."""def__init__(self,files:Sequence[Path])->None:self._files=filesdefcreate_reader(self,tokenizer:Tokenizer,gangs:Gangs,*,...)->DataReader[SequenceBatch]:"""Create a data reader for distributed training."""...# Create data pipelinebuilder=read_sequence(self._files)# Shard files across ranks for distributed trainingiffile_world_size>1:builder.shard(file_rank,file_world_size,allow_uneven=True)# Define how to read individual filesdefread_file(file:Path)->DataPipeline:...builder.yield_from(read_file)...# Packing for efficient trainingbuilder.pack(...)...# Background prefetching for performancebuilder.prefetch(prefetch)# Convert to SequenceBatch formatdefto_batch(example:dict[str,Any])->SequenceBatch:seqs,seq_lens=example["seqs"],example["seq_lens"]returnSequenceBatch(seqs,seq_lens,packed=True)pipeline=builder.map(to_batch).and_return()returnDataPipelineReader[SequenceBatch](pipeline,gangs,...)@dataclassclassLMTrainDatasetConfig:"""Configuration for LM training dataset."""path:Path=field(default_factory=Path)defopen_lm_train_dataset(config:LMTrainDatasetConfig)->LMTrainDataset:"""Factory function to create dataset from configuration."""path=config.path.expanduser().resolve()ifnotpath.is_dir():# Single filefiles=[path]else:# Directory of JSONL filesfiles=[fforfinpath.glob("**/*.chunk.*.jsonl")ifnotf.is_dir()]files.sort()returnLMTrainDataset(files)
Distributed by Design: Automatic file sharding across data parallel ranks
Efficient Packing: Sequences packed to maximize GPU utilization
Performance Optimized: Background prefetching and pinned memory
Flexible Input: Supports both single files and directories of files
torch.compile Ready: Proper BatchLayout configuration for compilation
The recipe class ties everything together with minimal boilerplate:
File: `recipe.py`
@finalclassLMTrainRecipe(TrainRecipe):"""Language model pretraining recipe."""@overridedefregister(self,container:DependencyContainer)->None:"""Register dataset family with the dependency container."""register_dataset_family(container,LM_TRAIN_DATASET,# Dataset type identifierLMTrainDataset,# Dataset classLMTrainDatasetConfig,# Configuration classopener=open_lm_train_dataset,# Factory function)@overridedefcreate_trainer(self,context:RecipeContext)->Trainer:"""Create the trainer with model and data configuration."""...# Get typed configurationconfig=context.config.as_(LMTrainConfig)# Create training unit (defines loss computation)unit=LMTrainUnit(context.model)# Get dataset and create data readerdataset=context.default_dataset.as_(LMTrainDataset)data_reader=dataset.create_reader(...)# Create trainer using context helperreturncontext.create_trainer(unit,data_reader)@property@overridedefconfig_kls(self)->type[object]:"""Return the configuration class for this recipe."""returnLMTrainConfig@finalclassLMTrainUnit(TrainUnit[SequenceBatch]):"""Training unit that defines how to process batches."""def__init__(self,model:RecipeModel)->None:self._model=model@overridedefprocess_batch(self,batch:SequenceBatch,metric_bag:MetricBag)->tuple[Tensor,None]:"""Process a single batch and compute loss."""# Split batch into input and target sequencesinput_batch,target_batch=batch.as_auto_regressive()# Get sequences and layout for model inputseqs,seqs_layout=input_batch.as_input()# Compute loss using the modelnll_loss=self._model.module(seqs,seqs_layout,...)# Update metricsupdate_nll_loss_metric(metric_bag,nll_loss)update_seq_batch_metrics(metric_bag,batch)returnnll_loss,None
Minimal Interface: Only 3 methods to override (register, create_trainer, config_kls)
Automatic Dependency Injection: Components are wired together by the framework
Type Safety: Strong typing throughout with IDE support
Flexible Training Logic: Easy to customize loss computation and metrics
Once you’ve created these four files, running your recipe is straightforward:
Basic Usage:
# Run with default configuration
python-mrecipes.lm.train/output/dir
# Check the default configuration (yaml format)
python-mrecipes.lm.train--dump-config
# Override configuration with your own yaml file + config overrides
python-myour_package.lm.train\--config-file/path/to/config.yaml\--configmodel.name=llama3_2_1b_instructregime.num_steps=20lr_scheduler.config.num_warmup_steps=10