Skip to main content

nvlink

Overview

Monitors NVLink errors and status using NVIDIA Data Center GPU Manager (DCGM). Detects CRC errors, replay/recovery errors, and link connectivity issues across GPUs.

Command-Line Options

OptionTypeDefaultDescription
--check / -cMultiple-Checks to perform: nvlink_errors, nvlink_status (required, can specify both)
--gpu_num / -gInteger8Number of GPUs to check
--data_error_thresholdInteger0CRC Data Error threshold (errors to tolerate)
--flit_error_thresholdInteger0CRC FLIT Error threshold
--recovery_error_thresholdInteger0Recovery Error threshold
--replay_error_thresholdInteger0Replay Error threshold
--hostStringlocalhostDCGM host endpoint to connect to
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Check Types

Monitors NVLink error counts per GPU link:

  • CRC Data Error - Data corruption errors
  • CRC FLIT Error - Flow control unit errors
  • Replay Error - Link replay requests
  • Recovery Error - Link recovery events

For each GPU (0 to gpu_num-1), checks all links and compares error counts against thresholds.

Monitors NVLink link status for all GPUs:

  • U (Up) - Link is operational
  • D (Down) - Link is down
  • X (Disabled) - Link is disabled

Detects any down or disabled links across all GPUs.

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All NVLink checks passed
WARN (1)DCGM command failed to execute
WARN (1)Output parsing failed
CRITICAL (2)NVLink errors exceed threshold
CRITICAL (2)NVLink links are down or disabled

Usage Examples

health_checks check-dcgmi nvlink \
--check nvlink_errors \
--gpu_num 8 \
[CLUSTER] \
app
health_checks check-dcgmi nvlink \
--check nvlink_status \
--gpu_num 8 \
[CLUSTER] \
prolog

Combined Checks

health_checks check-dcgmi nvlink \
--check nvlink_errors \
--check nvlink_status \
--gpu_num 8 \
[CLUSTER] \
epilog

With Error Thresholds

health_checks check-dcgmi nvlink \
--check nvlink_errors \
--gpu_num 8 \
--data_error_threshold 10 \
--flit_error_threshold 5 \
--replay_error_threshold 100 \
--recovery_error_threshold 50 \
[CLUSTER] \
app

Debug Mode

health_checks check-dcgmi nvlink \
--check nvlink_errors \
--gpu_num 8 \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app