diag

Overview

Performs comprehensive GPU diagnostics using NVIDIA Data Center GPU Manager (DCGM). Validates GPU health across deployment, integration, hardware, and stress test categories.

Command-Line Options

Option	Type	Default	Description
`--diag_level`	Integer (1-4)	1	Diagnostic depth: 1=quick, 2=medium, 3=long, 4=extended
`--exclude_category` / `-x`	Multiple	[]	Skip specific test categories (can specify multiple)
`--host`	String	localhost	DCGM host endpoint to connect to
`--timeout`	Integer	300	Command execution timeout in seconds
`--sink`	String	do_nothing	Telemetry sink destination
`--sink-opts`	Multiple	-	Sink-specific configuration
`--verbose-out`	Flag	False	Display detailed output
`--log-level`	Choice	INFO	DEBUG, INFO, WARNING, ERROR, CRITICAL
`--log-folder`	String	`/var/log/fb-monitoring`	Log directory
`--heterogeneous-cluster-v1`	Flag	False	Enable heterogeneous cluster support

Exit Conditions

Exit Code	Condition
OK (0)	Feature flag disabled (killswitch active)
OK (0)	All diagnostic tests passed
WARN (1)	DCGM command failed to execute
WARN (1)	Test returned warning status
WARN (1)	Output parsing failed
CRITICAL (2)	One or more diagnostic tests failed

Usage Examples

Basic Diagnostic

health_checks check-dcgmi diag [CLUSTER] app

Exclude Specific Categories

health_checks check-dcgmi diag \
  --exclude_category "Graphics Processes" \
  --exclude_category "Persistence Mode" \
  [CLUSTER] \
  app

Remote DCGM Host

health_checks check-dcgmi diag \
  --host dcgm-server.example.com \
  --timeout 180 \
  [CLUSTER] \
  app

Debug Mode

health_checks check-dcgmi diag \
  --diag_level 1 \
  --log-level DEBUG \
  --verbose-out \
  [CLUSTER] \
  app

Overview​

Command-Line Options​

Exit Conditions​

Usage Examples​

Basic Diagnostic​

Exclude Specific Categories​

Remote DCGM Host​

Debug Mode​

Overview

Command-Line Options

Exit Conditions

Usage Examples

Basic Diagnostic

Exclude Specific Categories

Remote DCGM Host

Debug Mode