Skip to main content

diag

Overview

Performs comprehensive GPU diagnostics using NVIDIA Data Center GPU Manager (DCGM). Validates GPU health across deployment, integration, hardware, and stress test categories.

Command-Line Options

OptionTypeDefaultDescription
--diag_levelInteger (1-4)1Diagnostic depth: 1=quick, 2=medium, 3=long, 4=extended
--exclude_category / -xMultiple[]Skip specific test categories (can specify multiple)
--hostStringlocalhostDCGM host endpoint to connect to
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All diagnostic tests passed
WARN (1)DCGM command failed to execute
WARN (1)Test returned warning status
WARN (1)Output parsing failed
CRITICAL (2)One or more diagnostic tests failed

Usage Examples

Basic Diagnostic

health_checks check-dcgmi diag [CLUSTER] app

Exclude Specific Categories

health_checks check-dcgmi diag \
--exclude_category "Graphics Processes" \
--exclude_category "Persistence Mode" \
[CLUSTER] \
app

Remote DCGM Host

health_checks check-dcgmi diag \
--host dcgm-server.example.com \
--timeout 180 \
[CLUSTER] \
app

Debug Mode

health_checks check-dcgmi diag \
--diag_level 1 \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app