This tutorial explains how to debug your training sessions, including multi-node runs, using the PuDB debugger <https://github.com/inducer/pudb>.
PuDB is one of several remote debuggers you can use with fairseq2.
Before setting a breakpoint, decide where in your code you want to start debugging.
Since fairseq2 supports multi-process training, ensure that the debugger is only invoked on the main process (rank 0) to prevent deadlocks.
Insert the following code where you want to set the breakpoint:
On the host machine specified in the host parameter (e.g., in our case it’s meta-fairseq2), run the following command to start listening on the specified port:
stty-echo-icanon&&nc-l-p6899
Note
The command will appear to hang, which is expected as it’s waiting for the debugger to connect.
Ensure that the chosen port (6899 in this case) is open and accessible.
In the other terminal / pane you need to start the fairseq2 training as usual. Here we show an example using slurm cluster.
Allocate Resources:
Obtain a compute allocation based on your cluster’s configuration. Here’s an example command using SLURM:
# Adjust the arguments (`--nodes`, `--ntasks-per-node`, etc.) as needed for your environment
salloc--nodes=1--ntasks-per-node=8--cpus-per-task=10-t1:00:00--gpus-per-node=8
Start Training:
Launch your fairseq2 training job as you normally would. For example, for LLM training:
Once the training reaches the breakpoint, the PuDB interface will appear in the terminal where you initialized the socket.
Example screenshot of the debugger:
Please refer to the PuDB docs and repo to explore more features and familiarize yourself with the interface.
PuDB supports all standard pdb commands in the source view and offers additional functionality for an enhanced debugging experience.