Trainer¶

The fairseq2.recipes.trainer.Trainer class is the core class for training models.

Overview¶

The trainer in fairseq2 is designed to be flexible and model-agnostic, handling various training scenarios from simple models to complex distributed training setups. It is probably the most complex system in fairseq2, but also the most powerful.

        flowchart LR
    %% Main Trainer Class
    A[Trainer] --> B[TrainUnit]
    A --> C[DataReader]
    A --> D[Optimizer]
    A --> E[CheckpointManager]
    A --> H[LRScheduler]
    A --> I[Gang System]
    A --> P[Metrics Logging]
    A --> V[Validation]

    %% TrainUnit Components
    B --> F[Model]

    %% Gang System
    I --> J[Root Gang]
    I --> K[DP Gang]
    I --> L[TP Gang]

    %% Metrics Logging
    P --> P1[TensorBoard]
    P --> P2[WandB]
    P --> P3[JSON Logger]

    %% Validation
    V --> Q[EvalUnit]
    V --> R[Validation DataReader]

    %% CheckpointManager Details
    E --> E1[Save State]
    E --> E2[Load State]
    E --> E3[Keep Best Checkpoints]
    E --> E4[Save FSDP Model]

Core Components¶

TrainUnit¶

The TrainUnit is an abstract class that encapsulates model-specific training logic:

class TrainUnit(ABC, Generic[BatchT_contra]):
    """Represents a unit to be used with Trainer."""

    @abstractmethod
    def __call__(self, batch: BatchT_contra) -> tuple[Tensor, int | None]:
        """Process batch and return loss and number of targets."""

    @abstractmethod
    def set_step_nr(self, step_nr: int) -> None:
        """Set current training step number."""

    @property
    @abstractmethod
    def model(self) -> Module:
        """The underlying model."""

Trainer Configuration¶

The fairseq2.recipes.trainer.Trainer class accepts a wide range of configuration options:

# Example Trainer Configuration
trainer = Trainer(
    # Basic parameters
    unit=train_unit,                     # Training unit to compute loss
    data_reader=data_reader,             # Data reader for training batches
    optimizer=optimizer,                 # Optimizer
    checkpoint_manager=checkpoint_mgr,   # Checkpoint manager
    root_gang=root_gang,                 # Root gang for distributed training

    # Optional parameters
    dp_gang=dp_gang,                     # Data parallel gang
    tp_gang=tp_gang,                     # Tensor parallel gang
    dtype=torch.float32,                 # Model data type
    lr_scheduler=lr_scheduler,           # Learning rate scheduler
    max_num_steps=100_000,               # Maximum training steps
    max_num_data_epochs=10,              # Maximum training epochs

    # Validation parameters
    valid_units=[valid_unit],            # Validation units
    valid_data_readers=[valid_reader],   # Validation data readers
    validate_every_n_steps=1_000,        # Validation frequency

    # Checkpoint parameters
    checkpoint_every_n_steps=5_000,      # Checkpoint frequency
    keep_last_n_checkpoints=5,           # Number of checkpoints to keep
    keep_best_n_checkpoints=3,           # Number of best checkpoints to keep

    # Metric parameters
    publish_metrics_every_n_steps=100,   # Metric publishing frequency
    tb_dir=Path("runs"),                 # TensorBoard directory
    metrics_dir=Path("metrics"),         # Metrics directory

    # Advanced parameters
    fp16_loss_scale=(128.0, 0.0001),    # Initial and min loss scale for fp16
    max_gradient_norm=None,              # Max gradient norm for clipping
    amp=False,                           # Enable automatic mixed precision
    anomaly_detection=False,             # Enable autograd anomaly detection
    seed=2                               # Random seed
)

Training Flow¶

The training process follows this simplified sequence:

        sequenceDiagram
    participant T as Trainer
    participant U as TrainUnit
    participant D as DataReader
    participant M as Model
    participant O as Optimizer

    T->>D: Request batch
    D-->>T: Return batch
    T->>U: Process batch
    U->>M: Forward pass
    M-->>U: Return loss
    U-->>T: Return loss, num_targets
    T->>M: Backward pass
    T->>O: Update parameters
    T->>T: Update metrics

Best Practices¶

Metric Tracking: - Register all relevant metrics in the train unit - Use appropriate metric types (Mean, Sum, etc.) - Consider adding validation metrics
Resource Management: - Use appropriate batch sizes for your hardware - Enable amp for memory efficiency - Configure gradient accumulation as needed
Checkpoint Management: - Save checkpoints regularly - Use both keep_last_n_checkpoints and keep_best_n_checkpoints - Consider separate policies for full checkpoints vs models
Validation: - Validate at appropriate intervals - Track relevant validation metrics - Implement early stopping if needed

Advanced Features¶

Early Stopping:

def early_stopper(step_nr: int, score: float) -> bool:
    # Custom early stopping logic
    return score < threshold

metric_descriptors = get_runtime_context().get_registry(MetricDescriptor)

try:
    score_metric_descriptor = metric_descriptors.get(metric_name)
except LookupError:
    raise UnknownMetricDescriptorError(metric_name) from None

trainer = Trainer(
    early_stopper=early_stopper,
    score_metric_descriptor=score_metric_descriptor,
    lower_better=True,
)

Custom Learning Rate Scheduling:

class CustomLRScheduler(LRScheduler):
    def get_lr(self) -> float:
        # Custom LR calculation
        return self.base_lr * decay_factor(self.step_nr)

trainer = Trainer(
    lr_scheduler=CustomLRScheduler(optimizer),
)

Profiling:

num_skip_steps, num_record_steps = (100, 10)

profile_dir = Path("logs/tb")

profiler = TorchProfiler(
    num_skip_steps, num_record_steps, profile_dir, gangs.root
)

trainer = Trainer(
    profiler=profiler,
    ...
)