Building Recipes

fairseq2 v0.5 introduces a simplified approach to creating custom training recipes. This tutorial shows how to build a complete language model pretraining recipe with just a few focused classes, showcasing the power and flexibility of the new recipe system.

Overview

The new recipe system eliminates much of the complexity found in earlier versions, allowing you to focus on what matters most: your model, data, and training logic. A complete custom recipe consists of just four main components:

        flowchart LR
    %% Styling
    classDef recipeBox fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#01579b
    classDef configBox fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
    classDef datasetBox fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#1b5e20
    classDef entryBox fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#e65100

    R[Recipe Class<br/>LMTrainRecipe]:::recipeBox
    C[Configuration<br/>LMTrainConfig]:::configBox
    D[Dataset<br/>LMTrainDataset]:::datasetBox
    E[Entry Point<br/>__main__.py]:::entryBox

    E --> R
    R --> C
    R --> D
    C --> D
    
  • Minimal boilerplate: Just 3 methods to override in your recipe class

  • Automatic dependency injection: Components are wired together automatically

  • Type-safe configuration: Dataclass-based configs with IDE support

  • Pluggable datasets: Easy to swap data sources and formats

  • One-line execution: Single command to run your recipe

Let’s go through the language model pretraining recipe step by step.

Step 0: Setup the Entry Point

The entry point is remarkably simple in v0.5:

from fairseq2.recipe.cli import train_main
from .recipe import LMTrainRecipe

# Create recipe instance
recipe = LMTrainRecipe()

# Run training with automatic CLI handling
train_main(recipe)

That’s it! Just 3 lines of code to create a complete training entry point.

What :meth:`fairseq2.recipe.cli.train_main` provides automatically:

  • Command line argument parsing with recipe-specific options

  • Configuration loading and validation from files or command line

  • Distributed training setup with proper process group initialization

  • Logging configuration with structured output and metrics

  • Error handling with graceful shutdown and debugging support

  • Checkpoint management with automatic save/load functionality

Then what we need to do is to build our LMTrainRecipe class. As a quick preview, here is the skeleton of the recipe class:

from fairseq2.recipe.base import RecipeContext, TrainRecipe

@final
class LMTrainRecipe(TrainRecipe):
    """Language model pretraining recipe."""

    @override
    def register(self, container: DependencyContainer) -> None:
        """Register dataset family with the dependency container."""
        register_dataset_family(...)

    @override
    def create_trainer(self, context: RecipeContext) -> Trainer:
        """Create the trainer with model and data configuration."""
        ...
        # TODO: build this config class for our recipe
        config = context.config.as_(LMTrainConfig)

        # TODO: build this Train unit class which defines loss computation
        unit = LMTrainUnit(context.model)

        # TODO: build the dataset and create data reader
        dataset = context.default_dataset.as_(LMTrainDataset)
        data_reader = dataset.create_reader(...)

        # Create trainer using context helper
        return context.create_trainer(unit, data_reader)

    @property
    @override
    def config_kls(self) -> type[object]:
        """Return the configuration class for this recipe."""
        return LMTrainConfig

So, let’s start with the first step.

Step 1: Define Your Configuration

Configuration in fairseq2 uses dataclasses with sensible defaults and clear structure:

File: ``config.py``

@dataclass(kw_only=True)
class LMTrainConfig:
    """Configuration for language model pretraining."""

    # Model configuration
    model: ModelSection = field(
        default_factory=lambda: ModelSection(...)
    )

    # Dataset configuration
    dataset: LMTrainDatasetSection = field(
        default_factory=lambda: LMTrainDatasetSection(...)
    )

    # Tokenizer selection
    tokenizer: TokenizerSection = field(
        default_factory=lambda: TokenizerSection(...)
    )

    # Distributed training setup
    gang: GangSection = field(default_factory=lambda: GangSection())

    # Training parameters
    trainer: TrainerSection = field(
        default_factory=lambda: TrainerSection(...)
    )

    # Optimizer configuration
    optimizer: OptimizerSection = field(
        default_factory=lambda: OptimizerSection(...)
    )

    # Learning rate scheduler
    lr_scheduler: LRSchedulerSection | None = field(
        default_factory=lambda: LRSchedulerSection(...)
    )

    # Training regime
    regime: RegimeSection = field(
        default_factory=lambda: RegimeSection(...)
    )

    # Common settings
    common: CommonSection = field(default_factory=lambda: CommonSection(...))

@dataclass(kw_only=True)
class LMTrainDatasetSection(DatasetSection):
    """Dataset-specific configuration parameters."""
    ...
  • Simple Structure: Each section controls a specific aspect of training

  • Sensible Defaults: Ready-to-use settings for beginners

  • Type Safety: Full IDE support with autocompletion

  • Customizable: Easy to override values via command line or config files

Step 2: Implement Your Dataset

The dataset component handles data loading and preprocessing:

File: `dataset.py`

 @final
 class LMTrainDataset:
     """Language model training dataset supporting JSONL files."""

     def __init__(self, files: Sequence[Path]) -> None:
         self._files = files

     def create_reader(
         self,
         tokenizer: Tokenizer,
         gangs: Gangs,
         *,
         ...
     ) -> DataReader[SequenceBatch]:
         """Create a data reader for distributed training."""

         ...

         # Create data pipeline
         builder = read_sequence(self._files)

         # Shard files across ranks for distributed training
         if file_world_size > 1:
             builder.shard(file_rank, file_world_size, allow_uneven=True)

         # Define how to read individual files
         def read_file(file: Path) -> DataPipeline:
             ...

         builder.yield_from(read_file)

         ...

         # Packing for efficient training
         builder.pack(...)

         ...

         # Background prefetching for performance
         builder.prefetch(prefetch)

         # Convert to SequenceBatch format
         def to_batch(example: dict[str, Any]) -> SequenceBatch:
             seqs, seq_lens = example["seqs"], example["seq_lens"]
             return SequenceBatch(seqs, seq_lens, packed=True)

         pipeline = builder.map(to_batch).and_return()

         return DataPipelineReader[SequenceBatch](
             pipeline,
             gangs,
             ...
         )

@dataclass
class LMTrainDatasetConfig:
    """Configuration for LM training dataset."""
    path: Path = field(default_factory=Path)

def open_lm_train_dataset(config: LMTrainDatasetConfig) -> LMTrainDataset:
    """Factory function to create dataset from configuration."""
    path = config.path.expanduser().resolve()

    if not path.is_dir():
        # Single file
        files = [path]
    else:
        # Directory of JSONL files
        files = [f for f in path.glob("**/*.chunk.*.jsonl") if not f.is_dir()]
        files.sort()

    return LMTrainDataset(files)
  • Distributed by Design: Automatic file sharding across data parallel ranks

  • Efficient Packing: Sequences packed to maximize GPU utilization

  • Performance Optimized: Background prefetching and pinned memory

  • Flexible Input: Supports both single files and directories of files

  • torch.compile Ready: Proper BatchLayout configuration for compilation

Step 3: Create Your Recipe Class

The recipe class ties everything together with minimal boilerplate:

File: `recipe.py`

@final
class LMTrainRecipe(TrainRecipe):
    """Language model pretraining recipe."""

    @override
    def register(self, container: DependencyContainer) -> None:
        """Register dataset family with the dependency container."""
        register_dataset_family(
            container,
            LM_TRAIN_DATASET,           # Dataset type identifier
            LMTrainDataset,             # Dataset class
            LMTrainDatasetConfig,       # Configuration class
            opener=open_lm_train_dataset,  # Factory function
        )

    @override
    def create_trainer(self, context: RecipeContext) -> Trainer:
        """Create the trainer with model and data configuration."""
        ...
        # Get typed configuration
        config = context.config.as_(LMTrainConfig)

        # Create training unit (defines loss computation)
        unit = LMTrainUnit(context.model)

        # Get dataset and create data reader
        dataset = context.default_dataset.as_(LMTrainDataset)
        data_reader = dataset.create_reader(...)

        # Create trainer using context helper
        return context.create_trainer(unit, data_reader)

    @property
    @override
    def config_kls(self) -> type[object]:
        """Return the configuration class for this recipe."""
        return LMTrainConfig

@final
class LMTrainUnit(TrainUnit[SequenceBatch]):
    """Training unit that defines how to process batches."""

    def __init__(self, model: RecipeModel) -> None:
        self._model = model

    @override
    def process_batch(
        self,
        batch: SequenceBatch,
        metric_bag: MetricBag
    ) -> tuple[Tensor, None]:
        """Process a single batch and compute loss."""
        # Split batch into input and target sequences
        input_batch, target_batch = batch.as_auto_regressive()

        # Get sequences and layout for model input
        seqs, seqs_layout = input_batch.as_input()

        # Compute loss using the model
        nll_loss = self._model.module(
            seqs,
            seqs_layout,
            ...
        )

        # Update metrics
        update_nll_loss_metric(metric_bag, nll_loss)
        update_seq_batch_metrics(metric_bag, batch)

        return nll_loss, None
  • Minimal Interface: Only 3 methods to override (register, create_trainer, config_kls)

  • Automatic Dependency Injection: Components are wired together by the framework

  • Type Safety: Strong typing throughout with IDE support

  • Flexible Training Logic: Easy to customize loss computation and metrics

  • Built-in Validation: Context provides validation helpers

Running Your Recipe

Once you’ve created these four files, running your recipe is straightforward:

Basic Usage:

# Run with default configuration
python -m recipes.lm.train /output/dir

# Check the default configuration (yaml format)
python -m recipes.lm.train --dump-config

# Override configuration with your own yaml file + config overrides
python -m your_package.lm.train \
    --config-file /path/to/config.yaml \
    --config model.name=llama3_2_1b_instruct regime.num_steps=20 lr_scheduler.config.num_warmup_steps=10

See Also