HuggingFace Model SFT Training with fairseq2¶
fairseq2 provides integration with HuggingFace Transformers models through the
fairseq2.models.hg module. This guide walks through a complete
supervised fine-tuning (SFT) example using a Gemma model loaded via HuggingFace
and trained with fairseq2’s recipe system and distributed training
infrastructure.
Prerequisites¶
Ensure fairseq2 is installed and the virtual environment is activated:
source .venv/bin/activate
The example uses google/gemma-3-1b-it from HuggingFace Hub and the
facebook/fairseq2-lm-gsm8k dataset. Both are downloaded automatically on
first use.
Running the Recipe¶
fairseq2 uses a YAML-based recipe system for training. A pre-configured
example for Gemma SFT on GSM8K is provided at
recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml.
Single GPU (recommended):
python -m recipes.lm.sft \
--config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
/path/to/output
Multi-GPU with FSDP (requires 2+ GPUs):
torchrun --nproc_per_node=2 -m recipes.lm.sft \
--config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
/path/to/output
Override config values from the command line:
python -m recipes.lm.sft \
--config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
--config regime.num_steps=100 dataset.max_seq_len=2048 \
/path/to/output
Dump the full default config (useful for reference):
python -m recipes.lm.sft --dump-config
Understanding the Configuration¶
The complete Gemma SFT config is shown below. Each section is explained afterwards.
model:
name: null
family: "hg"
arch: "causal_lm"
config_overrides:
hf_name: "google/gemma-3-1b-it"
model_type: "causal_lm"
trust_remote_code: true
tokenizer:
path: "google/gemma-3-1b-it"
family: "hg"
dataset:
max_seq_len: 4096
max_num_tokens: 8192
chat_mode: false
config_overrides:
sources:
train:
- path: "hg://facebook/fairseq2-lm-gsm8k"
split: "sft_train"
weight: 1.0
common:
metric_recorders:
wandb:
enabled: False
project: "fairseq2"
run_name: "sft_gemma_3_1b_it_gsm8k"
regime:
num_steps: 500
checkpoint_every_n_steps: 500
validate_every_n_steps: 10000
checkpoint_every_n_data_epochs: 100
keep_last_n_checkpoints: 1
publish_metrics_every_n_steps: 10
save_model_only: true
export_hugging_face: false
Model Section¶
model:
name: null
family: "hg"
arch: "causal_lm"
config_overrides:
hf_name: "google/gemma-3-1b-it"
model_type: "causal_lm"
trust_remote_code: true
name: nulldisables the default fairseq2 model lookup so the model is loaded entirely through the HuggingFace integration.family: "hg"selects the HuggingFace model family, which usesload_causal_lm()internally.arch: "causal_lm"tells the factory to useAutoModelForCausalLM.config_overridespasses fields toHuggingFaceModelConfig:hf_nameis the HuggingFace Hub model identifier.model_type: "causal_lm"ensures the model is wrapped inHgCausalLMAdapter, which adapts the HuggingFace model to fairseq2’sCausalLMinterface.
Compare this with a native fairseq2 model config (e.g. LLaMA) which only needs
name:
model:
name: "llama3_2_1b"
Tokenizer Section¶
tokenizer:
path: "google/gemma-3-1b-it"
family: "hg"
pathspecifies the HuggingFace tokenizer to load (same identifier as the model). This usesAutoTokenizer.from_pretrainedunder the hood viaload_hg_tokenizer_simple().family: "hg"selects the HuggingFace tokenizer family.
Dataset Section¶
dataset:
max_seq_len: 4096
max_num_tokens: 8192
chat_mode: false
config_overrides:
sources:
train:
- path: "hg://facebook/fairseq2-lm-gsm8k"
split: "sft_train"
weight: 1.0
max_seq_len: 4096drops sequences longer than 4096 tokens.max_num_tokens: 8192enables dynamic batching — each batch contains at most 8192 tokens total, automatically adjusting the number of sequences per batch based on their lengths.chat_mode: falseuses standard SFT behavior where all tokens after the source are treated as training targets. HuggingFace tokenizers do not generate theassistant_maskrequired by fairseq2’s chat mode.sourcesdefines the training data. Thehg://prefix loads datasets from HuggingFace Hub. Multiple sources with different weights can be specified for weighted sampling.
The dataset expects JSONL entries with src and tgt fields:
{"src": "What is 2 + 2?", "tgt": "2 + 2 = 4. The answer is 4."}
Regime Section¶
regime:
num_steps: 500
checkpoint_every_n_steps: 500
validate_every_n_steps: 10000
keep_last_n_checkpoints: 1
publish_metrics_every_n_steps: 10
save_model_only: true
export_hugging_face: false
num_steps: 500trains for 500 optimizer steps (roughly 5 epochs over the GSM8K dataset on a single GPU).save_model_only: truesaves only the model weights, not the optimizer state. This produces smaller checkpoints suitable for inference.export_hugging_face: falseshould be disabled for HuggingFace models since they are already in HuggingFace format.
Key Differences from Native fairseq2 Models¶
When using HuggingFace models with the recipe system, there are a few differences compared to native fairseq2 model families like LLaMA:
Model loading: Set
family: "hg"and provideconfig_overrideswith the HuggingFace model identifier instead of using a fairseq2name.Tokenizer: Set
family: "hg"and usepath(notname) to point to the HuggingFace tokenizer.Chat mode: Disable
chat_modeunless your HuggingFace tokenizer generatesassistant_maskfields. Standard SFT (all target tokens contribute to loss) works out of the box.HuggingFace export: Set
export_hugging_face: false— the model is already in HuggingFace format.
For comparison, a native LLaMA config is much shorter because the model and tokenizer are registered in fairseq2’s asset system:
model:
name: "llama3_2_1b"
tokenizer:
name: "llama3_2_1b"
dataset:
max_seq_len: 4096
max_num_tokens: 8192
chat_mode: true
config_overrides:
sources:
train:
- path: "hg://facebook/fairseq2-lm-gsm8k"
split: "sft_train"
weight: 1.0
Adapting to Other HuggingFace Models¶
To fine-tune a different HuggingFace model, copy the Gemma config and change the model identifier:
model:
name: null
family: "hg"
arch: "causal_lm"
config_overrides:
hf_name: "mistralai/Mistral-7B-Instruct-v0.3"
model_type: "causal_lm"
tokenizer:
path: "mistralai/Mistral-7B-Instruct-v0.3"
family: "hg"
For seq2seq models (e.g. T5), change arch and model_type:
model:
name: null
family: "hg"
arch: "seq2seq_lm"
config_overrides:
hf_name: "google-t5/t5-small"
model_type: "seq2seq_lm"
Checkpointing and Resumption¶
The recipe uses fairseq2’s
StandardCheckpointManager for robust
checkpointing:
Automatic resume: Re-running the same command with the same output directory automatically resumes from the last checkpoint.
Distributed-safe: Works correctly in multi-GPU setups.
Checkpoints saved to:
{output_dir}/checkpoints/step_{N}/
# First run (trains and saves checkpoint)
python -m recipes.lm.sft \
--config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
/path/to/output
# Resume from checkpoint (automatically detects and loads)
python -m recipes.lm.sft \
--config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
/path/to/output
Multi-GPU Notes¶
When running with FSDP (the default for multi-GPU):
Each rank sees its local device (e.g.,
cuda:0from rank 0’s perspective). The model is actually sharded across all GPUs — this is expected FSDP behavior.Only rank 0 prints output to avoid clutter.
Adjust
max_num_tokensif you run out of memory on multi-GPU setups.
Note
For production training, consider tuning the optimizer and learning rate
scheduler. The recipe system exposes optimizer and lr_scheduler
config sections — run python -m recipes.lm.sft --dump-config to see
all available options.