Debugging with PuDB

What you will learn
  • How to debug multi-node training using PuDB

Prerequisites
  • Get familiar with fairseq2 basics ( Overview)

  • Ensure you have fairseq2 installed ( Installation)

  • Understand how to use CLI (CLI)

  • Install PuDB: pip install pudb

This tutorial explains how to debug your training sessions, including multi-node runs, using the PuDB debugger <https://github.com/inducer/pudb>. PuDB is one of several remote debuggers you can use with fairseq2.

Placing the debugger breakpoint in the code

Before setting a breakpoint, decide where in your code you want to start debugging. Since fairseq2 supports multi-process training, ensure that the debugger is only invoked on the main process (rank 0) to prevent deadlocks.

Insert the following code where you want to set the breakpoint:

from fairseq2.gang import get_rank
if get_rank() == 0:
    from pudb.remote import set_trace

set_trace(host="meta-fairseq2", port=6899, term_size=(80*3, 24*3), reverse=True)

Explanation:

  • host="meta-fairseq2": Replace with the hostname accessible to both the machine running fairseq2 and your local machine.

  • port=6899: Choose an appropriate port that is open and not in use.

  • term_size=(80*3, 24*3): Sets the terminal size for the debugger interface.

  • reverse=True: Instructs the debugger to initiate the connection from the host.

Initializing the socket for remote debugger

On the host machine specified in the host parameter (e.g., in our case it’s meta-fairseq2), run the following command to start listening on the specified port:

stty -echo -icanon && nc -l -p 6899

Note

  • The command will appear to hang, which is expected as it’s waiting for the debugger to connect.

  • Ensure that the chosen port (6899 in this case) is open and accessible.

Running fairseq2 with debugger

In the other terminal / pane you need to start the fairseq2 training as usual. Here we show an example using slurm cluster.

  1. Allocate Resources:

    Obtain a compute allocation based on your cluster’s configuration. Here’s an example command using SLURM:

    # Adjust the arguments (`--nodes`, `--ntasks-per-node`, etc.) as needed for your environment
    salloc --nodes=1 --ntasks-per-node=8 --cpus-per-task=10 -t 1:00:00 --gpus-per-node=8
    
  2. Start Training:

    Launch your fairseq2 training job as you normally would. For example, for LLM training:

    srun fairseq2 lm preference_finetune_w_eval $OUTPUT_DIR --no-sweep-dir --config-file $CONFIG_YAML
    
  3. Connect to the Debugger:

    Once the training reaches the breakpoint, the PuDB interface will appear in the terminal where you initialized the socket.

Example screenshot of the debugger:

PuDB example

Please refer to the PuDB docs and repo to explore more features and familiarize yourself with the interface. PuDB supports all standard pdb commands in the source view and offers additional functionality for an enhanced debugging experience.

Exiting the debugger

Press q to quit the debugger. This will terminate the socket session and stop the training job.