Recipes in fairseq2 use a structured configuration system based on dataclasses. Let’s examine the configuration structure using the instruction fine-tuning recipe as an example.
dataset:name:foo# Dataset namefamily:instruction# Dataset familypath:/path/to/data# Path to datasettrain_split:default# Training split namemax_seq_len:8192# Maximum sequence lengthmax_num_tokens:16384# Maximum tokens per batchbatch_size:null# Fixed batch size (if specified)example_shuffle_window:10000# Window size for example shufflingbatch_shuffle_window:1000# Window size for batch shufflingnum_prefetch:4# Number of batches to prefetch
Specifies training behavior and hardware settings:
trainer:dtype:bfloat16# Training data typedata_parallelism:fsdp# Data parallel strategy (ddp or fsdp)activation_checkpointing:true# Use activation checkpointinggradient_accumulation:1# Gradient accumulation stepstorch_compile:false# Use torch.compilemixed_precision:static# Mixed precision modefsdp:# FSDP-specific settingsgranularity:layerreshard_after_forward:true
regime:num_steps:5000# Total training stepsvalidate_every_n_steps:100# Validation frequencycheckpoint_every_n_steps:1000# Checkpoint frequencykeep_last_n_checkpoints:1# Number of checkpoints to keeppublish_metrics_every_n_steps:10# Metrics logging frequency
fairseq2 provides preset configurations for common scenarios:
# Available presets for instruction fine-tuning:-llama3_1_instruct# Base LLaMA 3 1.8B-llama3_1_instruct_constant_lr# With constant learning rate-llama3_1_instruct_lr_anneal_0# With LR annealing to 0-llama3_1_70b_instruct# LLaMA 3 70B-llama2_7b_chat# LLaMA 2 7B-llama2_70b_chat# LLaMA 2 70B