Monitor Your Experiments¶

TensorBoard¶

fairseq2 saves checkpoints and tensorboard events to the defined $OUTPUT_DIR, which allows you to investigate into the details in your jobs.

# run tensorboard at your ckpt path
tensorboard --logdir $CHECKPOINT_PATH

# example
tensorboard --logdir /checkpoint/$USER/outputs/ps_llama3_1_instruct.ws_16.a73dad52/tb/train

If you ran your experiment on your server, you probably need to port forward the tensorboard service to your local machine:

ssh -L 6006:localhost:6006 $USER@$SERVER_NAME

Then you can view the tensorboard service in your browser http://localhost:6006.

WanDB¶

fairseq2 natively support WanDB (Weights & Biases) - a powerful tool for monitoring and managing machine learning experiments. WanDB provides a centralized platform to track, compare, and analyze the performance of different models, making it easier to identify trends, optimize hyperparameters, and reproduce results. Follow the quick start guide to initialize it in your environment.

What you need to do is simply add the following line in your config YAML file:

common:
    metric_recorders:
        wandb:
            _set_:
                enabled: true
                project: <YOUR_PROJECT_NAME>
                run: <YOUR_JOB_RUN_NAME>

Then run your recipe with fairseq2 ... --config-file <YOUR_CONFIG>.yaml.

Then you can open up your WanDB Portal and check the results in real-time.