HuggingFace Model SFT Training with fairseq2

fairseq2 provides integration with HuggingFace Transformers models through the fairseq2.models.hg module. This guide walks through a complete supervised fine-tuning (SFT) example using a Gemma model loaded via HuggingFace and trained with fairseq2’s recipe system and distributed training infrastructure.

Prerequisites

Ensure fairseq2 is installed and the virtual environment is activated:

source .venv/bin/activate

The example uses google/gemma-3-1b-it from HuggingFace Hub and the facebook/fairseq2-lm-gsm8k dataset. Both are downloaded automatically on first use.

Running the Recipe

fairseq2 uses a YAML-based recipe system for training. A pre-configured example for Gemma SFT on GSM8K is provided at recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml.

Single GPU (recommended):

python -m recipes.lm.sft \
    --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
    /path/to/output

Multi-GPU with FSDP (requires 2+ GPUs):

torchrun --nproc_per_node=2 -m recipes.lm.sft \
    --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
    /path/to/output

Override config values from the command line:

python -m recipes.lm.sft \
    --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
    --config regime.num_steps=100 dataset.max_seq_len=2048 \
    /path/to/output

Dump the full default config (useful for reference):

python -m recipes.lm.sft --dump-config

Understanding the Configuration

The complete Gemma SFT config is shown below. Each section is explained afterwards.

model:
  name: null
  family: "hg"
  arch: "causal_lm"
  config_overrides:
    hf_name: "google/gemma-3-1b-it"
    model_type: "causal_lm"
    trust_remote_code: true

tokenizer:
  path: "google/gemma-3-1b-it"
  family: "hg"

dataset:
  max_seq_len: 4096
  max_num_tokens: 8192
  chat_mode: false
  config_overrides:
    sources:
      train:
      - path: "hg://facebook/fairseq2-lm-gsm8k"
        split: "sft_train"
        weight: 1.0

common:
  metric_recorders:
    wandb:
      enabled: False
      project: "fairseq2"
      run_name: "sft_gemma_3_1b_it_gsm8k"

regime:
  num_steps: 500
  checkpoint_every_n_steps: 500
  validate_every_n_steps: 10000
  checkpoint_every_n_data_epochs: 100
  keep_last_n_checkpoints: 1
  publish_metrics_every_n_steps: 10
  save_model_only: true
  export_hugging_face: false

Model Section

model:
  name: null
  family: "hg"
  arch: "causal_lm"
  config_overrides:
    hf_name: "google/gemma-3-1b-it"
    model_type: "causal_lm"
    trust_remote_code: true
  • name: null disables the default fairseq2 model lookup so the model is loaded entirely through the HuggingFace integration.

  • family: "hg" selects the HuggingFace model family, which uses load_causal_lm() internally.

  • arch: "causal_lm" tells the factory to use AutoModelForCausalLM.

  • config_overrides passes fields to HuggingFaceModelConfig:

    • hf_name is the HuggingFace Hub model identifier.

    • model_type: "causal_lm" ensures the model is wrapped in HgCausalLMAdapter, which adapts the HuggingFace model to fairseq2’s CausalLM interface.

Compare this with a native fairseq2 model config (e.g. LLaMA) which only needs name:

model:
  name: "llama3_2_1b"

Tokenizer Section

tokenizer:
  path: "google/gemma-3-1b-it"
  family: "hg"
  • path specifies the HuggingFace tokenizer to load (same identifier as the model). This uses AutoTokenizer.from_pretrained under the hood via load_hg_tokenizer_simple().

  • family: "hg" selects the HuggingFace tokenizer family.

Dataset Section

dataset:
  max_seq_len: 4096
  max_num_tokens: 8192
  chat_mode: false
  config_overrides:
    sources:
      train:
      - path: "hg://facebook/fairseq2-lm-gsm8k"
        split: "sft_train"
        weight: 1.0
  • max_seq_len: 4096 drops sequences longer than 4096 tokens.

  • max_num_tokens: 8192 enables dynamic batching — each batch contains at most 8192 tokens total, automatically adjusting the number of sequences per batch based on their lengths.

  • chat_mode: false uses standard SFT behavior where all tokens after the source are treated as training targets. HuggingFace tokenizers do not generate the assistant_mask required by fairseq2’s chat mode.

  • sources defines the training data. The hg:// prefix loads datasets from HuggingFace Hub. Multiple sources with different weights can be specified for weighted sampling.

The dataset expects JSONL entries with src and tgt fields:

{"src": "What is 2 + 2?", "tgt": "2 + 2 = 4. The answer is 4."}

Regime Section

regime:
  num_steps: 500
  checkpoint_every_n_steps: 500
  validate_every_n_steps: 10000
  keep_last_n_checkpoints: 1
  publish_metrics_every_n_steps: 10
  save_model_only: true
  export_hugging_face: false
  • num_steps: 500 trains for 500 optimizer steps (roughly 5 epochs over the GSM8K dataset on a single GPU).

  • save_model_only: true saves only the model weights, not the optimizer state. This produces smaller checkpoints suitable for inference.

  • export_hugging_face: false should be disabled for HuggingFace models since they are already in HuggingFace format.

Key Differences from Native fairseq2 Models

When using HuggingFace models with the recipe system, there are a few differences compared to native fairseq2 model families like LLaMA:

  1. Model loading: Set family: "hg" and provide config_overrides with the HuggingFace model identifier instead of using a fairseq2 name.

  2. Tokenizer: Set family: "hg" and use path (not name) to point to the HuggingFace tokenizer.

  3. Chat mode: Disable chat_mode unless your HuggingFace tokenizer generates assistant_mask fields. Standard SFT (all target tokens contribute to loss) works out of the box.

  4. HuggingFace export: Set export_hugging_face: false — the model is already in HuggingFace format.

For comparison, a native LLaMA config is much shorter because the model and tokenizer are registered in fairseq2’s asset system:

model:
  name: "llama3_2_1b"
tokenizer:
  name: "llama3_2_1b"
dataset:
  max_seq_len: 4096
  max_num_tokens: 8192
  chat_mode: true
  config_overrides:
    sources:
      train:
      - path: "hg://facebook/fairseq2-lm-gsm8k"
        split: "sft_train"
        weight: 1.0

Adapting to Other HuggingFace Models

To fine-tune a different HuggingFace model, copy the Gemma config and change the model identifier:

model:
  name: null
  family: "hg"
  arch: "causal_lm"
  config_overrides:
    hf_name: "mistralai/Mistral-7B-Instruct-v0.3"
    model_type: "causal_lm"

tokenizer:
  path: "mistralai/Mistral-7B-Instruct-v0.3"
  family: "hg"

For seq2seq models (e.g. T5), change arch and model_type:

model:
  name: null
  family: "hg"
  arch: "seq2seq_lm"
  config_overrides:
    hf_name: "google-t5/t5-small"
    model_type: "seq2seq_lm"

Checkpointing and Resumption

The recipe uses fairseq2’s StandardCheckpointManager for robust checkpointing:

  • Automatic resume: Re-running the same command with the same output directory automatically resumes from the last checkpoint.

  • Distributed-safe: Works correctly in multi-GPU setups.

  • Checkpoints saved to: {output_dir}/checkpoints/step_{N}/

Example - Resume Training
# First run (trains and saves checkpoint)
python -m recipes.lm.sft \
    --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
    /path/to/output

# Resume from checkpoint (automatically detects and loads)
python -m recipes.lm.sft \
    --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
    /path/to/output

Multi-GPU Notes

When running with FSDP (the default for multi-GPU):

  • Each rank sees its local device (e.g., cuda:0 from rank 0’s perspective). The model is actually sharded across all GPUs — this is expected FSDP behavior.

  • Only rank 0 prints output to avoid clutter.

  • Adjust max_num_tokens if you run out of memory on multi-GPU setups.

Note

For production training, consider tuning the optimizer and learning rate scheduler. The recipe system exposes optimizer and lr_scheduler config sections — run python -m recipes.lm.sft --dump-config to see all available options.