fairseq2.model_checkpoint

This module provides a memory-efficient model checkpoint loading API that supports lazy loading of various checkpoint formats while also supporting distributed configurations with tensor resharding capability.

The loaders support:

  • Memory-efficient lazy loading to avoid loading entire checkpoints into memory at once if the underlying format allows it. In particular relevant for large checkpoints that may not fit entirely in memory.

  • On-the-fly tensor resharding across different distributed configurations.

  • Optional memory mapping for reduced memory footprint.

  • State dict conversion for format compatibility.

  • Automatic format detection.

Example Usage
from fairseq2.model_checkpoint import DelegatingModelCheckpointLoader

gangs = ...  # Setup gangs

# Delegates model loading to the appropriate loader based on the checkpoint
# format.
loader = DelegatingModelCheckpointLoader()

checkpoint_path = ...  # Path to checkpoint file

# Load checkpoint parameters lazily
for key, tensor in loader.lazy_load(checkpoint_path, gangs):
    # Process tensor without loading entire checkpoint

Interfaces

class fairseq2.model_checkpoint.ModelCheckpointLoader[source]

Bases: ABC

Represents the abstract base class for model checkpoint loaders.

This class defines the interface for checkpoint loaders that can efficiently load model state by yielding parameters lazily rather than loading everything into memory at once.

abstract lazy_load(path: Path, gangs: Gangs, *, mmap: bool = False, restrict: bool = True, state_dict_converter: StateDictConverter | None = None, shard_specs: Mapping[str, ShardSpec] | None = None, shard_dims: Mapping[str, int] | None = None) Iterator[tuple[str, Tensor]][source]

Lazily loads parameters from the specified checkpoint path.

Yields tensors one at a time to minimize memory usage if the underlying format allows it. Supports tensor resharding and optional state dictionary conversion.

gangs is used to determine the distributed target configuration and shard yielded parameters accordingly. If None, no sharding will be performed and full parameters will be yielded.

If mmap is True, the checkpoint will be memory-mapped. This can reduce memory usage but may cause slower load times on some systems.

If restrict is True, pickle (if used) will be restricted to load only tensors and types that can be safely serialized and deserialized.

If state_dict_converter is provided, it will be used to transform the (sharded) state dictionaries in the checkpoint. Typically used to convert from one format such as Hugging Face Transformers to fairseq2.

If shard_dims is provided, it specifies the sharding dimension of each parameter as returned by get_sharding_dims(). Along with gangs, they enable on-the-fly parameter resharding during checkpoint loading. If None, no resharding will be performed and full parameters will be loaded.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Yields pairs of (parameter name, parameter) for each parameter in the checkpoint.

Raises:

ModelCheckpointError – If the checkpoint is not valid.

abstract supports_path(path: Path) bool[source]

Checks if this loader can handle the specified checkpoint path.

Classes

final class fairseq2.model_checkpoint.NativeModelCheckpointLoader(file_system: FileSystem, tensor_loader: TensorLoader)[source]

Bases: ModelCheckpointLoader

Loads native fairseq2 checkpoints.

The native fairseq2 format is optimized for efficient storage and loading of model checkpoints in distributed configurations.

lazy_load(path: Path, gangs: Gangs, *, mmap: bool = False, restrict: bool = True, state_dict_converter: StateDictConverter | None = None, shard_specs: Mapping[str, ShardSpec] | None = None, shard_dims: Mapping[str, int] | None = None) Iterator[tuple[str, Tensor]][source]

Lazily loads parameters from the specified checkpoint path.

Yields tensors one at a time to minimize memory usage if the underlying format allows it. Supports tensor resharding and optional state dictionary conversion.

gangs is used to determine the distributed target configuration and shard yielded parameters accordingly. If None, no sharding will be performed and full parameters will be yielded.

If mmap is True, the checkpoint will be memory-mapped. This can reduce memory usage but may cause slower load times on some systems.

If restrict is True, pickle (if used) will be restricted to load only tensors and types that can be safely serialized and deserialized.

If state_dict_converter is provided, it will be used to transform the (sharded) state dictionaries in the checkpoint. Typically used to convert from one format such as Hugging Face Transformers to fairseq2.

If shard_dims is provided, it specifies the sharding dimension of each parameter as returned by get_sharding_dims(). Along with gangs, they enable on-the-fly parameter resharding during checkpoint loading. If None, no resharding will be performed and full parameters will be loaded.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Yields pairs of (parameter name, parameter) for each parameter in the checkpoint.

Raises:

ModelCheckpointError – If the checkpoint is not valid.

supports_path(path: Path) bool[source]

Checks if this loader can handle the specified checkpoint path.

final class fairseq2.model_checkpoint.BasicModelCheckpointLoader(file_system: FileSystem, tensor_loader: TensorLoader)[source]

Bases: ModelCheckpointLoader

Loads single-file PyTorch checkpoints (.pt, .pth, .bin).

lazy_load(path: Path, gangs: Gangs, *, mmap: bool = False, restrict: bool = True, state_dict_converter: StateDictConverter | None = None, shard_specs: Mapping[str, ShardSpec] | None = None, shard_dims: Mapping[str, int] | None = None) Iterator[tuple[str, Tensor]][source]

Lazily loads parameters from the specified checkpoint path.

Yields tensors one at a time to minimize memory usage if the underlying format allows it. Supports tensor resharding and optional state dictionary conversion.

gangs is used to determine the distributed target configuration and shard yielded parameters accordingly. If None, no sharding will be performed and full parameters will be yielded.

If mmap is True, the checkpoint will be memory-mapped. This can reduce memory usage but may cause slower load times on some systems.

If restrict is True, pickle (if used) will be restricted to load only tensors and types that can be safely serialized and deserialized.

If state_dict_converter is provided, it will be used to transform the (sharded) state dictionaries in the checkpoint. Typically used to convert from one format such as Hugging Face Transformers to fairseq2.

If shard_dims is provided, it specifies the sharding dimension of each parameter as returned by get_sharding_dims(). Along with gangs, they enable on-the-fly parameter resharding during checkpoint loading. If None, no resharding will be performed and full parameters will be loaded.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Yields pairs of (parameter name, parameter) for each parameter in the checkpoint.

Raises:

ModelCheckpointError – If the checkpoint is not valid.

supports_path(path: Path) bool[source]

Checks if this loader can handle the specified checkpoint path.

final class fairseq2.model_checkpoint.SafetensorsCheckpointLoader(file_system: FileSystem, safetensors_loader: SafetensorsLoader)[source]

Bases: ModelCheckpointLoader

Loads Safetensors checkpoints.

This loader supports both single-file and multi-file Safetensors checkpoints where multi-file checkpoints typically follow the “model-x-of-N.safetensors” pattern as in Hugging Face Hub.

lazy_load(path: Path, gangs: Gangs, *, mmap: bool = False, restrict: bool = True, state_dict_converter: StateDictConverter | None = None, shard_specs: Mapping[str, ShardSpec] | None = None, shard_dims: Mapping[str, int] | None = None) Iterator[tuple[str, Tensor]][source]

Lazily loads parameters from the specified checkpoint path.

Yields tensors one at a time to minimize memory usage if the underlying format allows it. Supports tensor resharding and optional state dictionary conversion.

gangs is used to determine the distributed target configuration and shard yielded parameters accordingly. If None, no sharding will be performed and full parameters will be yielded.

If mmap is True, the checkpoint will be memory-mapped. This can reduce memory usage but may cause slower load times on some systems.

If restrict is True, pickle (if used) will be restricted to load only tensors and types that can be safely serialized and deserialized.

If state_dict_converter is provided, it will be used to transform the (sharded) state dictionaries in the checkpoint. Typically used to convert from one format such as Hugging Face Transformers to fairseq2.

If shard_dims is provided, it specifies the sharding dimension of each parameter as returned by get_sharding_dims(). Along with gangs, they enable on-the-fly parameter resharding during checkpoint loading. If None, no resharding will be performed and full parameters will be loaded.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Yields pairs of (parameter name, parameter) for each parameter in the checkpoint.

Raises:

ModelCheckpointError – If the checkpoint is not valid.

supports_path(path: Path) bool[source]

Checks if this loader can handle the specified checkpoint path.

final class fairseq2.model_checkpoint.DelegatingModelCheckpointLoader(loaders: Sequence[ModelCheckpointLoader], file_system: FileSystem)[source]

Bases: ModelCheckpointLoader

Delegates loading to format-specific checkpoint loaders.

This loader maintains a collection of specialized loaders and automatically selects the appropriate one based on the checkpoint file format. It provides a unified interface for loading various checkpoint formats without requiring the caller to handle format-specific logic.

The loader iterates through its registered loaders in order and uses the first one that reports it can handle the given path via ModelCheckpointLoader.supports_path().

lazy_load(path: Path, gangs: Gangs, *, mmap: bool = False, restrict: bool = True, state_dict_converter: StateDictConverter | None = None, shard_specs: Mapping[str, ShardSpec] | None = None, shard_dims: Mapping[str, int] | None = None) Iterator[tuple[str, Tensor]][source]

Lazily loads parameters from the specified checkpoint path.

Yields tensors one at a time to minimize memory usage if the underlying format allows it. Supports tensor resharding and optional state dictionary conversion.

gangs is used to determine the distributed target configuration and shard yielded parameters accordingly. If None, no sharding will be performed and full parameters will be yielded.

If mmap is True, the checkpoint will be memory-mapped. This can reduce memory usage but may cause slower load times on some systems.

If restrict is True, pickle (if used) will be restricted to load only tensors and types that can be safely serialized and deserialized.

If state_dict_converter is provided, it will be used to transform the (sharded) state dictionaries in the checkpoint. Typically used to convert from one format such as Hugging Face Transformers to fairseq2.

If shard_dims is provided, it specifies the sharding dimension of each parameter as returned by get_sharding_dims(). Along with gangs, they enable on-the-fly parameter resharding during checkpoint loading. If None, no resharding will be performed and full parameters will be loaded.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Yields pairs of (parameter name, parameter) for each parameter in the checkpoint.

Raises:

ModelCheckpointError – If the checkpoint is not valid.

supports_path(path: Path) bool[source]

Checks if this loader can handle the specified checkpoint path.

Functions

fairseq2.model_checkpoint.reshard_tensor(key: str, source_splits: list[list[Tensor]], source_shard_sizes: tuple[int, int], target_shard_sizes: tuple[int, int], target_shard_ranks: tuple[int, int], shard_specs: Mapping[str, ShardSpec] | None, shard_dims: Mapping[str, int] | None = None) Tensor[source]

Reshards a parameter tensor from a distributed source configuration to a target configuration.

This function is meant for authors of new ModelCheckpointLoader implementations. It handles the complex task of resharding tensors when loading checkpoints from one distributed configuration (e.g. 4-way tensor parallelism) to a different target configuration (e.g. 8-way tensor parallelism). It efficiently concatenates and slices tensors to produce the correct shards for the target rank. The existing implementations such as NativeModelCheckpointLoader may be inspected to see how reshard_tensor is used in practice.

The resharding process involves:

  1. Determining if the tensor requires tensor parallelism based on specified shard dimensions.

  2. For tensor parallel tensors, concatenating source shards and re-slicing for the target configuration in a memory-efficient way.

  3. For replicated tensors, concatenating data parallel splits.

key specifies the name of the parameter to retrieve its sharding information from shard_dims. See get_sharding_dims() for more information.

source_splits is a 2D list structure [tp_idx][dp_idx] containing the source tensor shards. The outer list specifies tensor parallel shards and inner lists specify data parallel shards.

source_shard_sizes and target_shard_sizes specify the distributed source and target configurations respectively in the form of (tp_size, dp_size).

target_shard_ranks specifies the ranks of the current process in the target configuration in the form of (tp_rank, dp_rank).

If shard_dims is provided, it specifies the mapping from parameter names to dimensions along which parameters should be sharded for tensor parallelism. Omitted for replicated tensors. See get_sharding_dims() for more information.

shard_specs is deprecated and will be removed in a future release; please use shard_dims instead.

Returns the resharded tensor for the target rank and configuration.

Resharding from 2-way TP to 4-way TP
# Resharding from 2-way TP to 4-way TP
source_splits = [[tensor_tp0_dp0], [tensor_tp1_dp0]]  # 2 TP shards, 1 DP shard each
source_shard_sizes = (2, 1)  # 2-way TP, 1-way DP
target_shard_sizes = (4, 1)  # 4-way TP, 1-way DP
target_shard_ranks = (2, 0)  # Want shard for TP rank 2

# For a tensor with TP dim=0, this will concatenate the 2 source shards
# and slice out the portion corresponding to TP rank 2 in 4-way setup
resharded = reshard_tensor(
    "model.weight",
    source_splits,
    source_shard_sizes,
    target_shard_sizes,
    target_shard_ranks,
    None,  # deprecated
    {"model.weight": 0}
)

Note

This function deletes intermediate tensors during the resharding process to minimize peak memory usage.

Exceptions

class fairseq2.model_checkpoint.ModelCheckpointError(path: Path, message: str)[source]

Bases: Exception

Raised when a model checkpoint is not valid.