Debugging with PuDB¶

This tutorial explains how to debug your training sessions, including multi-node runs, using the PuDB debugger <https://github.com/inducer/pudb>. PuDB is one of several remote debuggers you can use with fairseq2.

Placing the debugger breakpoint in the code¶

Before setting a breakpoint, decide where in your code you want to start debugging. Since fairseq2 supports multi-process training, ensure that the debugger is only invoked on the main process (rank 0) to prevent deadlocks.

Insert the following code where you want to set the breakpoint:

import os

from fairseq2.utils.env import get_rank

if get_rank(os.environ) == 0:
    from pudb.remote import set_trace

set_trace(host="meta-fairseq2", port=6899, term_size=(80*3, 24*3), reverse=True)

Explanation:

host="meta-fairseq2": Replace with the hostname accessible to both the machine running fairseq2 and your local machine.
port=6899: Choose an appropriate port that is open and not in use.
term_size=(80*3, 24*3): Sets the terminal size for the debugger interface.
reverse=True: Instructs the debugger to initiate the connection from the host.

Initializing the socket for remote debugger¶

On the host machine specified in the host parameter (e.g., in our case it’s meta-fairseq2), run the following command to start listening on the specified port:

stty -echo -icanon && nc -l -p 6899

Note

The command will appear to hang, which is expected as it’s waiting for the debugger to connect.
Ensure that the chosen port (6899 in this case) is open and accessible.

Running fairseq2 with debugger¶

In the other terminal / pane you need to start the fairseq2 training as usual. Here we show an example using slurm cluster.

Allocate Resources:

Obtain a compute allocation based on your cluster’s configuration. Here’s an example command using SLURM:
# Adjust the arguments (`--nodes`, `--ntasks-per-node`, etc.) as needed for your environment
salloc --nodes=1 --ntasks-per-node=8 --cpus-per-task=10 -t 1:00:00 --gpus-per-node=8

Start Training:

Launch your fairseq2 training job as you normally would. For example, for LLM training:
srun fairseq2 lm preference_finetune_w_eval $OUTPUT_DIR --no-sweep-dir --config-file $CONFIG_YAML

Connect to the Debugger:

Once the training reaches the breakpoint, the PuDB interface will appear in the terminal where you initialized the socket.

Example screenshot of the debugger:

Please refer to the PuDB docs and repo to explore more features and familiarize yourself with the interface. PuDB supports all standard pdb commands in the source view and offers additional functionality for an enhanced debugging experience.

Exiting the debugger¶

Press q to quit the debugger. This will terminate the socket session and stop the training job.