Skip to main content

check-nccl

Overview

Validates NCCL (NVIDIA Collective Communications Library) performance and correctness by running distributed GPU communication tests. Supports single-node and pairwise multi-node testing using MPI to orchestrate collective operations (all_reduce, all_gather, alltoall). Measures average bus bandwidth and compares against configurable thresholds to detect network degradation or GPU interconnect issues.

Requirements

  • NVIDIA GPUs on all tested nodes
  • NVIDIA driver and CUDA toolkit
  • MPI implementation (OpenMPI)
  • Network fabric configured (InfiniBand, RoCE, or TCP/IP)
  • Passwordless SSH between nodes (for MPI)
  • NCCL library installed

Required Binaries

Located in --nccl-tdir:

  • all_reduce_perf - For --op all_reduce
  • all_gather_perf - For --op all_gather
  • alltoall_perf - For --op alltoall

Installation:

# Clone and build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI_HOME=/path/to/mpirun CUDA_HOME=/path/to/cuda/12.0 NCCL_HOME=/path/to/NCCL/2.21.5-1
# Binaries in: ./build/

MPI Requirements

  • mpirun or compatible launcher in PATH
  • MPI can discover and connect to all hosts in hostlist
  • Sufficient process slots per node (typically = GPU count)

Verification:

# Test MPI connectivity
mpirun --host node1:8,node2:8 --np 16 hostname

# Check available slots
mpirun --host node1 --np 1 cat /proc/cpuinfo | grep processor | wc -l

Command-Line Options

OptionTypeDefaultDescription
--singleFlagTrueSingle-node NCCL testing (default)
--pairwise / --pairwise-exhaustiveFlagFalseTest all possible node pairs from hostlist
--pairwise-quickFlagFalseTest each node once (pairs: even-odd indices)
--mpi-binpathPathmpirunPath to mpirun binary
--mpi-optsString-mca coll_hcoll_enable 0 --bind-to numaOptions passed to mpirun
--np / -nIntegerAuto-calculatedNumber of MPI processes (defaults to nodes × GPUs)
--gpus-per-nodeInteger8GPUs per node (used for auto-calculating -np)
--hostlistStringlocalhostNode list (required for pairwise modes)
--export / -xString (multiple)NCCL_IB_PCI_RELAXED_ORDERING=1, CUDA_DEVICE_ORDER=PCI_BUS_ID, NCCL_SOCKET_IFNAME=eth0, NCCL_DEBUG=WARNEnvironment variables to export to MPI processes
--nccl-tdirPathRequiredDirectory containing NCCL test binaries
--nccl-toptsString-g 1 -b 32M -e 1G -f 2NCCL test options (see NCCL tests docs)
--op / -pChoice (multiple)RequiredOperations: all_gather, all_reduce, alltoall
--nvlink/--no-nvlinkFlag--no-nvlinkEnable/disable NVLink (disables P2P and SHM if off)
--critical-thresholdFloatRequiredCritical exit if if avg bus bw value < threshold (in GB/s)
--warn-thresholdFloatNoneWarning exit if if avg bus bw value < threshold (in GB/s)
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Hostlist Parsing

Uses SLURM-style nodelist expansion:

node[1-3]       → node1, node2, node3
gpu[01-04] → gpu01, gpu02, gpu03, gpu04
host1,host2 → host1, host2

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All tests passed thresholds
WARN (1)Test execution failed
WARN (1)Bandwidth parsing failed
WARN (1)Below warn threshold
CRITICAL (2)Below critical threshold

Aggregation: If any test returns CRITICAL, overall exit is CRITICAL. Else if any WARN, overall exit is WARN. Else OK.

Usage Examples

health_checks check-nccl \
--single \
--nccl-tdir /opt/nccl-tests/build \
--op all_reduce \
--nvlink \
--critical-threshold 200 \
--warn-threshold 250 \
--sink stdout \
[CLUSTER] \
app

Tests intra-node NVLink bandwidth, expecting >250 GB/s for OK, 200-250 for WARN, <200 for CRITICAL.

Pairwise Network Test (All Pairs)

health_checks check-nccl \
--pairwise \
--hostlist node[1-8] \
--nccl-tdir /opt/nccl-tests/build \
--op all_reduce \
--op all_gather \
--no-nvlink \
--critical-threshold 50 \
--warn-threshold 80 \
--gpus-per-node 8 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Tests all 28 node pairs (C(8,2)) with 2 operations each = 56 total tests.

Pairwise Quick Test

health_checks check-nccl \
--pairwise-quick \
--hostlist node[1-16] \
--nccl-tdir /opt/nccl-tests/build \
--op all_reduce \
--critical-threshold 40 \
--timeout 600 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Tests 8 node pairs (even-odd pairing), faster than exhaustive.

Custom MPI Configuration

health_checks check-nccl \
--single \
--nccl-tdir /opt/nccl-tests/build \
--op alltoall \
--critical-threshold 100 \
--mpi-binpath /usr/local/bin/mpirun \
--mpi-opts "-mca btl tcp,self -mca btl_tcp_if_include eth0" \
--export NCCL_DEBUG=INFO \
--export NCCL_IB_HCA=mlx5 \
--nccl-topts "-g 1 -b 8M -e 2G -f 2 -n 100" \
--sink stdout \
[CLUSTER] \
app

Multi-Operation Test with Custom Thresholds

health_checks check-nccl \
--pairwise-quick \
--hostlist gpu[01-32] \
--nccl-tdir /opt/nccl-tests/build \
--op all_reduce \
--op all_gather \
--op alltoall \
--nvlink \
--critical-threshold 150 \
--warn-threshold 200 \
--gpus-per-node 8 \
--np 128 \
--timeout 900 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Debug Mode with Verbose Output

health_checks check-nccl \
--single \
--nccl-tdir /opt/nccl-tests/build \
--op all_reduce \
--critical-threshold 50 \
--log-level DEBUG \
--verbose-out \
--export NCCL_DEBUG=INFO \
--sink stdout \
[CLUSTER] \
app