Monitor Your Experiments

What you will learn
  • How to monitor your experiments using Tensorboard

  • How to monitor your experiments using WanDB

Prerequisites

TensorBoard

TensorBoard

fairseq2 saves checkpoints and tensorboard events to the defined $OUTPUT_DIR, which allows you to investigate into the details in your jobs.

# run tensorboard at your ckpt path
tensorboard --logdir $CHECKPOINT_PATH

# example
tensorboard --logdir /checkpoint/$USER/outputs/ps_llama3_1_instruct.ws_16.a73dad52/tb/train

If you ran your experiment on your server, you probably need to port forward the tensorboard service to your local machine:

ssh -L 6006:localhost:6006 $USER@$SERVER_NAME

Then you can view the tensorboard service in your browser http://localhost:6006.

WanDB

WandB

fairseq2 natively support WanDB (Weights & Biases) - a powerful tool for monitoring and managing machine learning experiments. WanDB provides a centralized platform to track, compare, and analyze the performance of different models, making it easier to identify trends, optimize hyperparameters, and reproduce results. Follow the quick start guide to initialize it in your environment.

What you need to do is simply add the following line in your config YAML file:

wandb_project: <YOUR_PROJECT_NAME>

Then run your recipe with fairseq2 ... --config-file <YOUR_CONFIG>.yaml.

Or you can directly specify with fairseq2 ... --config wandb_project=<YOUR_PROJECT_NAME>.

Then you can open up your WanDB Portal and check the results in real-time.

A step-by-step example
ENV_NAME=...  # YOUR_ENV_NAME
CONFIG_FILE=...  # YOUR_CONFIG_FILE
OUTPUT_DIR=...  # YOUR_OUTPUT_DIR
WANDB_PROJECT_NAME=...  # YOUR_PROJECT_NAME

conda activate $ENV_NAME
# install wandb
pip install wandb
# initialize wandb, copy paste your token when prompted
wandb login --host=...  # your wandb hostname

# now you are good to go
fairseq2 lm instruction_finetune $OUTPUT_DIR \
--config-file $CONFIG_FILE \
--config wandb_project=$WANDB_PROJECT_NAME \

# cleanup
conda deactivate