Skip to main content

check-dcgmi

GPU diagnostics and NVLink validation using NVIDIA Data Center GPU Manager (DCGM).

Subcommands

SubcommandPurposeKey Feature
diagHardware diagnostics across multiple test levelsDeployment, integration, hardware, and stress testing with category exclusion
nvlinkNVLink error and status monitoringError threshold validation and link status detection

Quick Start

Run GPU Diagnostics

health_checks check-dcgmi diag \
--diag_level 1 \
[CLUSTER] \
app
health_checks check-dcgmi nvlink \
--check nvlink_errors \
--gpu_num 8 \
[CLUSTER] \
app
health_checks check-dcgmi nvlink \
--check nvlink_status \
--gpu_num 8 \
[CLUSTER] \
app