.. _guides-s3-checkpointing:

===========================
S3 Checkpointing
===========================

fairseq2 supports storing checkpoints on Amazon S3 while keeping other training
artifacts (logs, metrics, caches) on local or NFS storage. This hybrid approach
is useful when you want to leverage S3's scalability and durability for large
checkpoint files.

Prerequisites
=============

1. Install the ``s3fs`` package (included with fairseq2 by default):

   .. code-block:: bash

       pip install s3fs

2. Configure AWS credentials using one of the standard methods:

   - Environment variables (``AWS_ACCESS_KEY_ID``, ``AWS_SECRET_ACCESS_KEY``)
   - AWS credentials file (``~/.aws/credentials``)
   - IAM role (when running on AWS infrastructure)

Usage
=====

Use the ``--checkpoint-dir`` CLI option to redirect checkpoints to an S3 bucket:

.. code-block:: bash

    python -m recipes.lm.train /local/output/dir \
        --checkpoint-dir s3://my-bucket/checkpoints/experiment1

This will:

- Store checkpoints (and `model.yaml`) at ``s3://my-bucket/checkpoints/experiment1/config_hash/step_N/``
- Keep logs, metrics, and other artifacts in ``/local/output/dir/config_hash/``

The ``config_hash`` (e.g., ``ws_1.d2b3ae4f``) is automatically appended to both directories
based on the training configuration, ensuring consistent organization across local and remote storage.

Resuming from S3 Checkpoints
============================

To resume training from S3 checkpoints, use the same ``--checkpoint-dir`` option:

.. code-block:: bash

    python -m recipes.lm.train /local/output/dir \
        --checkpoint-dir s3://my-bucket/key/ \
        --resume-from last

The checkpoint manager will automatically detect and load the latest checkpoint
from S3.

Using S3 Paths in Asset Cards
=============================

You can define model or dataset cards that reference S3 paths directly. Here is
an example of a model card with checkpoints and tokenizer stored on S3:

.. code-block:: yaml

    # my_s3_model.yaml
    name: my_s3_model
    model_family: llama
    model_arch: llama3_8b
    checkpoint: "s3://my-bucket/models/my_model/checkpoint.pt"
    tokenizer: "s3://my-bucket/models/my_model/tokenizer.model"
    tokenizer_family: llama

And an example dataset card with data files on S3:

.. code-block:: yaml

    # my_s3_dataset.yaml
    name: my_s3_dataset
    dataset_family: generic_text
    path: "s3://my-bucket/datasets/my_dataset/"

To use these cards, add them to an asset store directory and specify it via:

.. code-block:: bash

    python -m recipes.lm.train /output/dir \
        --config common.asset.extra_paths="['/path/to/my/cards']"

Registering a Custom S3 Filesystem with AWS Profile
===================================================

If you need to use a specific AWS profile or custom S3 configuration (e.g., for
accessing buckets with different credentials), you can register a custom filesystem
before running your training:

.. code-block:: python

    import s3fs
    from fairseq2.file_system import FileSystemRegistry, FSspecFileSystem

    def register_s3_with_profile(
        profile_name: str,
        bucket_pattern: str | None = None,
    ) -> None:
        """
        Register an S3 filesystem with a specific AWS profile.

        Args:
            profile_name: The AWS profile name from ~/.aws/credentials
            bucket_pattern: Optional bucket name pattern to match. If provided,
                only S3 paths containing this pattern will use this filesystem.
                If None, this filesystem will be used for all S3 paths.
        """
        # Create S3 filesystem with the specified profile
        s3_fs = s3fs.S3FileSystem(profile=profile_name)

        # Define the pattern check function
        if bucket_pattern:
            def pattern_check(path) -> bool:
                path_str = str(path)
                return path_str.startswith("s3:/") and bucket_pattern in path_str
        else:
            def pattern_check(path) -> bool:
                return str(path).startswith("s3:/")

        # Wrap and register the filesystem
        wrapped_fs = FSspecFileSystem(s3_fs, "s3:/")
        FileSystemRegistry.register(pattern_check, lambda: wrapped_fs)

    # Example: Register S3 filesystem with "my-team-profile" for a specific bucket
    register_s3_with_profile(
        profile_name="my-team-profile",
        bucket_pattern="my-team-bucket",
    )

    # Now S3 paths like "s3://my-team-bucket/..." will use this profile


Programmatic Usage
==================

When using the training API directly, pass ``checkpoint_dir`` to the ``run()`` function:

.. code-block:: python

    from pathlib import Path
    from fairseq2.recipe import run

    run(
        recipe,
        config,
        output_dir=Path("/local/output/dir"),
        checkpoint_dir=Path("s3://my-bucket/key/"),
    )

Implementation Notes
====================

- **Atomic writes**: For local filesystems, checkpoints are written to a temporary
  directory (``step_N.tmp``) and atomically renamed upon completion. For S3,
  writes go directly to the final location since S3 doesn't support atomic renames.

- **Tested protocols**: Currently, only ``file`` (local), ``local``, and ``s3``
  protocols are officially supported. Other fsspec-compatible protocols may work
  but are not tested.

- **Filesystem priority**: When multiple filesystems are registered for
  overlapping patterns, the most recently registered one takes precedence.
  Register bucket-specific filesystems after the default S3 filesystem.

See Also
========

* :doc:`/basics/building_recipes` - Building custom training recipes
* :doc:`/basics/assets` - Understanding fairseq2 asset system
* :doc:`/reference/checkpoint` - Checkpoint API reference