S3 Checkpointing

fairseq2 supports storing checkpoints on Amazon S3 while keeping other training artifacts (logs, metrics, caches) on local or NFS storage. This hybrid approach is useful when you want to leverage S3’s scalability and durability for large checkpoint files.

Prerequisites

  1. Install the s3fs package (included with fairseq2 by default):

    pip install s3fs
    
  2. Configure AWS credentials using one of the standard methods:

    • Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

    • AWS credentials file (~/.aws/credentials)

    • IAM role (when running on AWS infrastructure)

Usage

Use the --checkpoint-dir CLI option to redirect checkpoints to an S3 bucket:

python -m recipes.lm.train /local/output/dir \
    --checkpoint-dir s3://my-bucket/checkpoints/experiment1

This will:

  • Store checkpoints (and model.yaml) at s3://my-bucket/checkpoints/experiment1/config_hash/step_N/

  • Keep logs, metrics, and other artifacts in /local/output/dir/config_hash/

The config_hash (e.g., ws_1.d2b3ae4f) is automatically appended to both directories based on the training configuration, ensuring consistent organization across local and remote storage.

Resuming from S3 Checkpoints

To resume training from S3 checkpoints, use the same --checkpoint-dir option:

python -m recipes.lm.train /local/output/dir \
    --checkpoint-dir s3://my-bucket/key/ \
    --resume-from last

The checkpoint manager will automatically detect and load the latest checkpoint from S3.

Using S3 Paths in Asset Cards

You can define model or dataset cards that reference S3 paths directly. Here is an example of a model card with checkpoints and tokenizer stored on S3:

# my_s3_model.yaml
name: my_s3_model
model_family: llama
model_arch: llama3_8b
checkpoint: "s3://my-bucket/models/my_model/checkpoint.pt"
tokenizer: "s3://my-bucket/models/my_model/tokenizer.model"
tokenizer_family: llama

And an example dataset card with data files on S3:

# my_s3_dataset.yaml
name: my_s3_dataset
dataset_family: generic_text
path: "s3://my-bucket/datasets/my_dataset/"

To use these cards, add them to an asset store directory and specify it via:

python -m recipes.lm.train /output/dir \
    --config common.asset.extra_paths="['/path/to/my/cards']"

Registering a Custom S3 Filesystem with AWS Profile

If you need to use a specific AWS profile or custom S3 configuration (e.g., for accessing buckets with different credentials), you can register a custom filesystem before running your training:

import s3fs
from fairseq2.file_system import FileSystemRegistry, FSspecFileSystem

def register_s3_with_profile(
    profile_name: str,
    bucket_pattern: str | None = None,
) -> None:
    """
    Register an S3 filesystem with a specific AWS profile.

    Args:
        profile_name: The AWS profile name from ~/.aws/credentials
        bucket_pattern: Optional bucket name pattern to match. If provided,
            only S3 paths containing this pattern will use this filesystem.
            If None, this filesystem will be used for all S3 paths.
    """
    # Create S3 filesystem with the specified profile
    s3_fs = s3fs.S3FileSystem(profile=profile_name)

    # Define the pattern check function
    if bucket_pattern:
        def pattern_check(path) -> bool:
            path_str = str(path)
            return path_str.startswith("s3:/") and bucket_pattern in path_str
    else:
        def pattern_check(path) -> bool:
            return str(path).startswith("s3:/")

    # Wrap and register the filesystem
    wrapped_fs = FSspecFileSystem(s3_fs, "s3:/")
    FileSystemRegistry.register(pattern_check, lambda: wrapped_fs)

# Example: Register S3 filesystem with "my-team-profile" for a specific bucket
register_s3_with_profile(
    profile_name="my-team-profile",
    bucket_pattern="my-team-bucket",
)

# Now S3 paths like "s3://my-team-bucket/..." will use this profile

Programmatic Usage

When using the training API directly, pass checkpoint_dir to the run() function:

from pathlib import Path
from fairseq2.recipe import run

run(
    recipe,
    config,
    output_dir=Path("/local/output/dir"),
    checkpoint_dir=Path("s3://my-bucket/key/"),
)

Implementation Notes

  • Atomic writes: For local filesystems, checkpoints are written to a temporary directory (step_N.tmp) and atomically renamed upon completion. For S3, writes go directly to the final location since S3 doesn’t support atomic renames.

  • Tested protocols: Currently, only file (local), local, and s3 protocols are officially supported. Other fsspec-compatible protocols may work but are not tested.

  • Filesystem priority: When multiple filesystems are registered for overlapping patterns, the most recently registered one takes precedence. Register bucket-specific filesystems after the default S3 filesystem.

See Also