==================================================
HuggingFace Model SFT Training with fairseq2
==================================================

.. currentmodule:: fairseq2.models.hg

fairseq2 provides integration with HuggingFace Transformers models through the
:mod:`fairseq2.models.hg` module. This guide walks through a complete
supervised fine-tuning (SFT) example using a Gemma model loaded via HuggingFace
and trained with fairseq2's recipe system and distributed training
infrastructure.

Prerequisites
=============

Ensure fairseq2 is installed and the virtual environment is activated:

.. code:: bash

    source .venv/bin/activate

The example uses ``google/gemma-3-1b-it`` from HuggingFace Hub and the
``facebook/fairseq2-lm-gsm8k`` dataset. Both are downloaded automatically on
first use.

Running the Recipe
==================

fairseq2 uses a YAML-based recipe system for training. A pre-configured
example for Gemma SFT on GSM8K is provided at
``recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml``.

**Single GPU** (recommended):

.. code:: bash

    python -m recipes.lm.sft \
        --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
        /path/to/output

**Multi-GPU with FSDP** (requires 2+ GPUs):

.. code:: bash

    torchrun --nproc_per_node=2 -m recipes.lm.sft \
        --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
        /path/to/output

**Override config values from the command line**:

.. code:: bash

    python -m recipes.lm.sft \
        --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
        --config regime.num_steps=100 dataset.max_seq_len=2048 \
        /path/to/output

**Dump the full default config** (useful for reference):

.. code:: bash

    python -m recipes.lm.sft --dump-config

Understanding the Configuration
================================

The complete Gemma SFT config is shown below. Each section is explained
afterwards.

.. code:: yaml

    model:
      name: null
      family: "hg"
      arch: "causal_lm"
      config_overrides:
        hf_name: "google/gemma-3-1b-it"
        model_type: "causal_lm"
        trust_remote_code: true

    tokenizer:
      path: "google/gemma-3-1b-it"
      family: "hg"

    dataset:
      max_seq_len: 4096
      max_num_tokens: 8192
      chat_mode: false
      config_overrides:
        sources:
          train:
          - path: "hg://facebook/fairseq2-lm-gsm8k"
            split: "sft_train"
            weight: 1.0

    common:
      metric_recorders:
        wandb:
          enabled: False
          project: "fairseq2"
          run_name: "sft_gemma_3_1b_it_gsm8k"

    regime:
      num_steps: 500
      checkpoint_every_n_steps: 500
      validate_every_n_steps: 10000
      checkpoint_every_n_data_epochs: 100
      keep_last_n_checkpoints: 1
      publish_metrics_every_n_steps: 10
      save_model_only: true
      export_hugging_face: false

Model Section
-------------

.. code:: yaml

    model:
      name: null
      family: "hg"
      arch: "causal_lm"
      config_overrides:
        hf_name: "google/gemma-3-1b-it"
        model_type: "causal_lm"
        trust_remote_code: true

- ``name: null`` disables the default fairseq2 model lookup so the model is
  loaded entirely through the HuggingFace integration.
- ``family: "hg"`` selects the HuggingFace model family, which uses
  :func:`load_causal_lm` internally.
- ``arch: "causal_lm"`` tells the factory to use ``AutoModelForCausalLM``.
- ``config_overrides`` passes fields to
  :class:`~fairseq2.models.hg.config.HuggingFaceModelConfig`:

  - ``hf_name`` is the HuggingFace Hub model identifier.
  - ``model_type: "causal_lm"`` ensures the model is wrapped in
    :class:`~fairseq2.models.hg.adapter.HgCausalLMAdapter`, which
    adapts the HuggingFace model to fairseq2's
    :class:`~fairseq2.models.clm.CausalLM` interface.

Compare this with a native fairseq2 model config (e.g. LLaMA) which only needs
``name``:

.. code:: yaml

    model:
      name: "llama3_2_1b"

Tokenizer Section
-----------------

.. code:: yaml

    tokenizer:
      path: "google/gemma-3-1b-it"
      family: "hg"

- ``path`` specifies the HuggingFace tokenizer to load (same identifier as the
  model). This uses ``AutoTokenizer.from_pretrained`` under the hood via
  :func:`load_hg_tokenizer_simple`.
- ``family: "hg"`` selects the HuggingFace tokenizer family.

Dataset Section
---------------

.. code:: yaml

    dataset:
      max_seq_len: 4096
      max_num_tokens: 8192
      chat_mode: false
      config_overrides:
        sources:
          train:
          - path: "hg://facebook/fairseq2-lm-gsm8k"
            split: "sft_train"
            weight: 1.0

- ``max_seq_len: 4096`` drops sequences longer than 4096 tokens.
- ``max_num_tokens: 8192`` enables dynamic batching — each batch contains at
  most 8192 tokens total, automatically adjusting the number of sequences per
  batch based on their lengths.
- ``chat_mode: false`` uses standard SFT behavior where all tokens after the
  source are treated as training targets. HuggingFace tokenizers do not generate
  the ``assistant_mask`` required by fairseq2's chat mode.
- ``sources`` defines the training data. The ``hg://`` prefix loads datasets
  from HuggingFace Hub. Multiple sources with different weights can be specified
  for weighted sampling.

The dataset expects JSONL entries with ``src`` and ``tgt`` fields:

.. code:: json

    {"src": "What is 2 + 2?", "tgt": "2 + 2 = 4. The answer is 4."}

Regime Section
--------------

.. code:: yaml

    regime:
      num_steps: 500
      checkpoint_every_n_steps: 500
      validate_every_n_steps: 10000
      keep_last_n_checkpoints: 1
      publish_metrics_every_n_steps: 10
      save_model_only: true
      export_hugging_face: false

- ``num_steps: 500`` trains for 500 optimizer steps (roughly 5 epochs over the
  GSM8K dataset on a single GPU).
- ``save_model_only: true`` saves only the model weights, not the optimizer
  state. This produces smaller checkpoints suitable for inference.
- ``export_hugging_face: false`` should be disabled for HuggingFace models since
  they are already in HuggingFace format.

Key Differences from Native fairseq2 Models
============================================

When using HuggingFace models with the recipe system, there are a few
differences compared to native fairseq2 model families like LLaMA:

1. **Model loading**: Set ``family: "hg"`` and provide ``config_overrides``
   with the HuggingFace model identifier instead of using a fairseq2
   ``name``.

2. **Tokenizer**: Set ``family: "hg"`` and use ``path`` (not ``name``) to
   point to the HuggingFace tokenizer.

3. **Chat mode**: Disable ``chat_mode`` unless your HuggingFace tokenizer
   generates ``assistant_mask`` fields. Standard SFT (all target tokens
   contribute to loss) works out of the box.

4. **HuggingFace export**: Set ``export_hugging_face: false`` — the model is
   already in HuggingFace format.

For comparison, a native LLaMA config is much shorter because the model and
tokenizer are registered in fairseq2's asset system:

.. code:: yaml

    model:
      name: "llama3_2_1b"
    tokenizer:
      name: "llama3_2_1b"
    dataset:
      max_seq_len: 4096
      max_num_tokens: 8192
      chat_mode: true
      config_overrides:
        sources:
          train:
          - path: "hg://facebook/fairseq2-lm-gsm8k"
            split: "sft_train"
            weight: 1.0

Adapting to Other HuggingFace Models
=====================================

To fine-tune a different HuggingFace model, copy the Gemma config and change
the model identifier:

.. code:: yaml

    model:
      name: null
      family: "hg"
      arch: "causal_lm"
      config_overrides:
        hf_name: "mistralai/Mistral-7B-Instruct-v0.3"
        model_type: "causal_lm"

    tokenizer:
      path: "mistralai/Mistral-7B-Instruct-v0.3"
      family: "hg"

For seq2seq models (e.g. T5), change ``arch`` and ``model_type``:

.. code:: yaml

    model:
      name: null
      family: "hg"
      arch: "seq2seq_lm"
      config_overrides:
        hf_name: "google-t5/t5-small"
        model_type: "seq2seq_lm"

Checkpointing and Resumption
=============================

The recipe uses fairseq2's
:class:`~fairseq2.checkpoint.StandardCheckpointManager` for robust
checkpointing:

- **Automatic resume**: Re-running the same command with the same output
  directory automatically resumes from the last checkpoint.
- **Distributed-safe**: Works correctly in multi-GPU setups.
- **Checkpoints saved to**: ``{output_dir}/checkpoints/step_{N}/``

.. code:: bash
    :caption: Example - Resume Training

    # First run (trains and saves checkpoint)
    python -m recipes.lm.sft \
        --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
        /path/to/output

    # Resume from checkpoint (automatically detects and loads)
    python -m recipes.lm.sft \
        --config-file recipes/lm/sft/configs/gemma_3_1b_it_gsm8k.yaml \
        /path/to/output

Multi-GPU Notes
===============

When running with FSDP (the default for multi-GPU):

- Each rank sees its local device (e.g., ``cuda:0`` from rank 0's perspective).
  The model is actually sharded across all GPUs — this is expected FSDP
  behavior.
- Only rank 0 prints output to avoid clutter.
- Adjust ``max_num_tokens`` if you run out of memory on multi-GPU setups.

.. note::

    For production training, consider tuning the optimizer and learning rate
    scheduler. The recipe system exposes ``optimizer`` and ``lr_scheduler``
    config sections — run ``python -m recipes.lm.sft --dump-config`` to see
    all available options.