fairseq2 saves checkpoints and tensorboard events to the defined $OUTPUT_DIR, which allows you to investigate into the details in your jobs.
# run tensorboard at your ckpt path
tensorboard--logdir$CHECKPOINT_PATH# example
tensorboard--logdir/checkpoint/$USER/outputs/ps_llama3_1_instruct.ws_16.a73dad52/tb/train
If you ran your experiment on your server, you probably need to port forward the tensorboard service to your local machine:
fairseq2 natively support WanDB (Weights & Biases) - a powerful tool for monitoring and managing machine learning experiments.
WanDB provides a centralized platform to track, compare, and analyze the performance of different models, making it easier to identify trends, optimize hyperparameters, and reproduce results.
Follow the quick start guide to initialize it in your environment.
What you need to do is simply add the following line in your config YAML file:
wandb_project:<YOUR_PROJECT_NAME>
Then run your recipe with fairseq2...--config-file<YOUR_CONFIG>.yaml.
Or you can directly specify with fairseq2...--configwandb_project=<YOUR_PROJECT_NAME>.
Then you can open up your WanDB Portal and check the results in real-time.
A step-by-step example
ENV_NAME=...# YOUR_ENV_NAMECONFIG_FILE=...# YOUR_CONFIG_FILEOUTPUT_DIR=...# YOUR_OUTPUT_DIRWANDB_PROJECT_NAME=...# YOUR_PROJECT_NAME
condaactivate$ENV_NAME# install wandb
pipinstallwandb
# initialize wandb, copy paste your token when prompted
wandblogin--host=...# your wandb hostname# now you are good to go
fairseq2lminstruction_finetune$OUTPUT_DIR\
--config-file$CONFIG_FILE\
--configwandb_project=$WANDB_PROJECT_NAME\# cleanup
condadeactivate