Llm finetuning¶

LLM fine-tuning with SPDL data loading pipeline.

Fine-tunes LLaMA 3.2 1B on Alpaca-style instruction data using LoRA, with SPDL PipelineBuilder for high-performance concurrent tokenization.

SPDL Data Pipeline¶

The core of this example is the SPDL data loading pipeline:

DistributedRandomSampler – distributes sample indices across ranks with per-epoch reshuffling
pipe(tokenize, concurrency=N) – concurrent Alpaca-format prompt formatting and tokenization
aggregate(batch_size) – groups into batches
pipe(collate) – stacks tensors
add_sink(buffer_size=3) – prefetch buffer for the training loop

Data¶

Download instruction-following datasets:

# https://github.com/tatsu-lab/stanford_alpaca
python download_alpaca.py --output /tmp/alpaca.jsonl
# https://huggingface.co/datasets/databricks/databricks-dolly-15k
python download_dolly.py --output /tmp/dolly.jsonl

Data format (JSONL with Alpaca-style fields):

{"instruction": "Explain what a linked list is.", "input": "", "output": "A linked list is..."}

Usage¶

torchrun \
  --nproc_per_node 8 \
  -m spdl.examples.llm_finetune.llm_finetuning \
  --model-path /path/to/Llama-3.2-1B-Instruct \
  --data-path \
    /tmp/alpaca.jsonl \
    /tmp/dolly.jsonl

With the default settings (global batch size 8x32), the training throughput reaches roughly ~570 samples on H100 GPUs.

Source¶

Source

Click here to see the source.

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

"""LLM fine-tuning with SPDL data loading pipeline.

Fine-tunes LLaMA 3.2 1B on Alpaca-style instruction data using LoRA,
with SPDL PipelineBuilder for high-performance concurrent tokenization.

SPDL Data Pipeline
^^^^^^^^^^^^^^^^^^

The core of this example is the SPDL data loading pipeline:

1. ``DistributedRandomSampler`` -- distributes sample indices across ranks
   with per-epoch reshuffling
2. ``pipe(tokenize, concurrency=N)`` -- concurrent Alpaca-format prompt
   formatting and tokenization
3. ``aggregate(batch_size)`` -- groups into batches
4. ``pipe(collate)`` -- stacks tensors
5. ``add_sink(buffer_size=3)`` -- prefetch buffer for the training loop

Data
^^^^

Download instruction-following datasets::

    # https://github.com/tatsu-lab/stanford_alpaca
    python download_alpaca.py --output /tmp/alpaca.jsonl
    # https://huggingface.co/datasets/databricks/databricks-dolly-15k
    python download_dolly.py --output /tmp/dolly.jsonl

Data format (JSONL with Alpaca-style fields)::

    {"instruction": "Explain what a linked list is.", "input": "", "output": "A linked list is..."}

Usage
^^^^^

::

    torchrun \\
      --nproc_per_node 8 \\
      -m spdl.examples.llm_finetune.llm_finetuning \\
      --model-path /path/to/Llama-3.2-1B-Instruct \\
      --data-path \\
        /tmp/alpaca.jsonl \\
        /tmp/dolly.jsonl

With the default settings (global batch size 8x32), the training throughput reaches roughly ~570
samples on H100 GPUs.
"""

from __future__ import annotations

__all__ = [
    "build_model",
    "build_pipeline",
    "iterate_pipeline",
    "load_data",
    "main",
    "train",
]

# pyre-strict

import argparse
import logging
import os
import threading
import time
from collections.abc import Callable, Iterator
from datetime import timedelta
from pathlib import Path
from typing import TYPE_CHECKING

import torch
import torch.distributed as dist
from spdl.pipeline import PipelineBuilder
from spdl.source import DistributedRandomSampler
from torch.nn.parallel import DistributedDataParallel as DDP

if TYPE_CHECKING:
    from transformers import PreTrainedTokenizerBase

try:
    from examples.llm_finetune.utils.utils import (  # pyre-ignore[21]
        format_prompt,
        load_data,
        report_progress,
        resolve_model_path,
    )
except ImportError:
    from spdl.examples.llm_finetune.utils.utils import (
        format_prompt,
        load_data,
        report_progress,
        resolve_model_path,
    )

_LG: logging.Logger = logging.getLogger(__name__)


# ---------------------------------------------------------------------------
# SPDL data pipeline
# ---------------------------------------------------------------------------


def build_pipeline(
    samples: list[dict[str, str]],
    tokenizer: PreTrainedTokenizerBase,
    max_seq_len: int,
    batch_size: int,
    rank: int,
    world_size: int,
    num_threads: int,
    seed: int,
) -> PipelineBuilder:
    """Build an SPDL pipeline for concurrent tokenization.

    Pipeline stages:
      1. Source: DistributedRandomSampler yields sample indices
      2. Pipe: Look up sample by index
      3. Pipe (concurrent): Format prompt and tokenize
      4. Aggregate: Group into batches
      5. Pipe: Collate into tensors
      6. Sink: Buffer for the training loop
    """

    sampler = DistributedRandomSampler(
        len(samples),
        rank=rank,
        world_size=world_size,
        seed=seed,
    )

    # The HuggingFace fast tokenizer's Rust backend is not thread-safe.
    # Use thread-local copies so each SPDL worker thread has its own instance.
    class _TokenizerTLS(threading.local):
        tokenizer: PreTrainedTokenizerBase | None = None

    _tls: _TokenizerTLS = _TokenizerTLS()

    def _get_tokenizer() -> PreTrainedTokenizerBase:
        if _tls.tokenizer is None:
            import copy

            _tls.tokenizer = copy.deepcopy(tokenizer)
        return _tls.tokenizer

    def lookup(idx: int) -> dict[str, str]:
        return samples[idx]

    def tokenize(sample: dict[str, str]) -> dict[str, torch.Tensor]:
        tok = _get_tokenizer()
        text = format_prompt(sample)
        enc = tok(
            text,
            max_length=max_seq_len,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        input_ids = enc["input_ids"].squeeze(0)
        attention_mask = enc["attention_mask"].squeeze(0)
        # For causal LM, labels = input_ids; mask padding with -100
        labels = input_ids.clone()
        labels[attention_mask == 0] = -100
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
        }

    def collate(items: list[dict[str, torch.Tensor]]) -> dict[str, torch.Tensor]:
        return {
            "input_ids": torch.stack([it["input_ids"] for it in items]),
            "attention_mask": torch.stack([it["attention_mask"] for it in items]),
            "labels": torch.stack([it["labels"] for it in items]),
        }

    return (
        PipelineBuilder()
        .add_source(sampler)
        .pipe(lookup)
        .pipe(tokenize, concurrency=num_threads)
        .aggregate(batch_size)
        .pipe(collate)
        .add_sink(buffer_size=3)
    )


def iterate_pipeline(
    pipeline_builder: PipelineBuilder,
    num_threads: int,
    device: torch.device,
) -> Iterator[dict[str, torch.Tensor]]:
    """Build, run, and iterate over the SPDL pipeline, transferring to device."""
    pipeline = pipeline_builder.build(num_threads=num_threads)
    with pipeline.auto_stop():
        for batch in pipeline.get_iterator(timeout=120):
            yield {k: v.to(device, non_blocking=True) for k, v in batch.items()}


# ---------------------------------------------------------------------------
# Model setup
# ---------------------------------------------------------------------------


def build_model(
    model_path: str,
    device: torch.device,
    lora_r: int,
    lora_alpha: int,
    lora_dropout: float,
) -> torch.nn.Module:
    """Load LLaMA model and apply LoRA."""
    from peft import get_peft_model, LoraConfig, TaskType
    from transformers import AutoModelForCausalLM

    _LG.info("Loading model from %s", model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        attn_implementation="sdpa",
    )

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=["q_proj", "v_proj"],
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    model = model.to(device)
    return model


# ---------------------------------------------------------------------------
# Training
# ---------------------------------------------------------------------------


def train(
    *,
    model_path: str,
    data_path: list[str],
    output_dir: str,
    max_seq_len: int,
    batch_size: int,
    num_epochs: int,
    lr: float,
    weight_decay: float,
    max_grad_norm: float,
    log_interval: int,
    lora_r: int,
    lora_alpha: int,
    lora_dropout: float,
    num_workers: int,
    progress_fn: Callable[[int, int], None] | None = None,
) -> None:
    """Main training function, called per-rank."""
    rank: int = dist.get_rank()
    world_size: int = dist.get_world_size()
    local_rank: int = int(os.environ.get("LOCAL_RANK", 0))
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)

    _LG.info("Rank %d/%d on device %s", rank, world_size, device)

    # --- Data ---
    samples = load_data(data_path)

    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(model_path)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # --- Model ---
    model = build_model(
        model_path,
        device,
        lora_r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
    )
    ddp_model = DDP(model, device_ids=[local_rank])

    # --- Optimizer ---
    optimizer = torch.optim.AdamW(
        ddp_model.parameters(),
        lr=lr,
        weight_decay=weight_decay,
        foreach=True,
    )

    num_steps_per_epoch = len(samples) // (batch_size * world_size)
    total_steps = num_steps_per_epoch * num_epochs
    if rank == 0:
        _LG.info(
            "Training: %d samples, %d epochs, %d steps/epoch, %d total steps",
            len(samples),
            num_epochs,
            num_steps_per_epoch,
            total_steps,
        )
        if progress_fn is not None:
            progress_fn(0, total_steps)

    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer,
        T_max=total_steps,
        eta_min=lr * 0.1,
    )

    # --- Training loop ---
    global_step = 0
    ddp_model.train()

    for epoch in range(num_epochs):
        _LG.info("Epoch %d/%d", epoch + 1, num_epochs)

        pipeline_builder = build_pipeline(
            samples=samples,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            batch_size=batch_size,
            rank=rank,
            world_size=world_size,
            num_threads=num_workers,
            seed=epoch,  # different shuffle per epoch
        )

        t0 = time.monotonic()
        epoch_loss = 0.0
        num_batches = 0

        for batch in iterate_pipeline(pipeline_builder, num_workers, device):
            outputs = ddp_model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["labels"],
            )
            loss = outputs.loss

            loss.backward()
            torch.nn.utils.clip_grad_norm_(ddp_model.parameters(), max_grad_norm)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            epoch_loss += loss.item()
            num_batches += 1
            global_step += 1

            if rank == 0:
                if progress_fn is not None:
                    progress_fn(global_step, total_steps)
                if global_step % log_interval == 0:
                    avg_loss = epoch_loss / num_batches
                    elapsed = time.monotonic() - t0
                    _LG.info(
                        "Step %d | loss=%.4f | lr=%.2e | %.1f samples/s",
                        global_step,
                        avg_loss,
                        scheduler.get_last_lr()[0],
                        num_batches * batch_size * world_size / elapsed,
                    )

        elapsed = time.monotonic() - t0
        if rank == 0:
            avg_loss = epoch_loss / max(num_batches, 1)
            _LG.info(
                "Epoch %d complete | avg_loss=%.4f | %.1fs | %.1f samples/s",
                epoch + 1,
                avg_loss,
                elapsed,
                num_batches * batch_size * world_size / elapsed,
            )

    # --- Save ---
    if rank == 0 and output_dir:
        output_path = Path(output_dir)
        output_path.mkdir(parents=True, exist_ok=True)
        ddp_model.module.save_pretrained(output_path)
        tokenizer.save_pretrained(output_path)
        _LG.info("Model saved to %s", output_path)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description=__doc__)
    # Model
    parser.add_argument(
        "--model-path",
        type=str,
        required=True,
        help="Path to pretrained LLaMA model directory",
    )
    parser.add_argument(
        "--output-dir",
        type=str,
        default="",
        help="Directory to save fine-tuned LoRA weights",
    )
    # Data
    parser.add_argument(
        "--data-path",
        type=str,
        nargs="+",
        required=True,
        help="One or more paths to Alpaca-format JSONL files (local or manifold://).",
    )
    parser.add_argument("--max-seq-len", type=int, default=512)
    # Training
    parser.add_argument("--batch-size", type=int, default=4)
    parser.add_argument("--num-epochs", type=int, default=10)
    parser.add_argument("--lr", type=float, default=5e-4)
    parser.add_argument("--weight-decay", type=float, default=0.01)
    parser.add_argument("--max-grad-norm", type=float, default=1.0)
    parser.add_argument("--log-interval", type=int, default=10)
    # LoRA
    parser.add_argument("--lora-r", type=int, default=8)
    parser.add_argument("--lora-alpha", type=int, default=16)
    parser.add_argument("--lora-dropout", type=float, default=0.05)
    # Pipeline
    parser.add_argument(
        "--num-workers",
        type=int,
        default=8,
        help="Concurrent tokenization workers in the SPDL pipeline",
    )
    return parser.parse_args()


def init_logging() -> None:
    """Initialize logging."""
    rank = os.environ.get("RANK", "?")
    logging.basicConfig(
        level=logging.INFO,
        format=f"%(asctime)s [%(levelname).1s] [Rank{rank}] %(name)s: %(message)s",
    )


def main(args: argparse.Namespace) -> None:
    dist.init_process_group(backend="nccl", timeout=timedelta(minutes=30))
    try:
        train(
            model_path=resolve_model_path(args.model_path),
            data_path=args.data_path,
            output_dir=args.output_dir,
            max_seq_len=args.max_seq_len,
            batch_size=args.batch_size,
            num_epochs=args.num_epochs,
            lr=args.lr,
            weight_decay=args.weight_decay,
            max_grad_norm=args.max_grad_norm,
            log_interval=args.log_interval,
            lora_r=args.lora_r,
            lora_alpha=args.lora_alpha,
            lora_dropout=args.lora_dropout,
            num_workers=args.num_workers,
            progress_fn=report_progress,
        )
    finally:
        dist.destroy_process_group()


if __name__ == "__main__":
    init_logging()
    main(parse_args())

API Reference¶

Functions

build_model(model_path: str, device: device, lora_r: int, lora_alpha: int, lora_dropout: float) → Module[source]¶: Load LLaMA model and apply LoRA.

build_pipeline(samples: list[dict[str, str]], tokenizer: PreTrainedTokenizerBase, max_seq_len: int, batch_size: int, rank: int, world_size: int, num_threads: int, seed: int) → PipelineBuilder[source]¶

Build an SPDL pipeline for concurrent tokenization.

Pipeline stages:

Source: DistributedRandomSampler yields sample indices
Pipe: Look up sample by index
Pipe (concurrent): Format prompt and tokenize
Aggregate: Group into batches
Pipe: Collate into tensors
Sink: Buffer for the training loop

iterate_pipeline(pipeline_builder: PipelineBuilder, num_threads: int, device: device) → Iterator[dict[str, Tensor]][source]¶: Build, run, and iterate over the SPDL pipeline, transferring to device.

load_data(paths: Sequence[str]) → list[dict[str, str]][source]¶: Load and concatenate data from one or more JSONL files.

main(args: Namespace) → None[source]¶

train(*, model_path: str, data_path: list[str], output_dir: str, max_seq_len: int, batch_size: int, num_epochs: int, lr: float, weight_decay: float, max_grad_norm: float, log_interval: int, lora_r: int, lora_alpha: int, lora_dropout: float, num_workers: int, progress_fn: Callable[[int, int], None] | None = None) → None[source]¶: Main training function, called per-rank.