Skip to main content

check-nvidia-smi

Comprehensive GPU health validation using NVIDIA's NVML (NVIDIA Management Library) API for GPU hardware state, memory integrity, thermal status, and process occupancy.

Available Health Checks

CheckPurposeKey Feature
clock_freqClock frequency validationEnsure GPU/memory clocks meet minimums
ecc_corrected_volatile_totalCorrected ECC errorsMonitor corrected error accumulation
ecc_uncorrected_volatile_totalUncorrected ECC errorsValidate uncorrected error counts
gpu_mem_usageMemory usage checkVerify GPU memory usage below limit
gpu_numGPU count validationVerify expected number of GPUs detected
gpu_retired_pagesRetired pages trackingMonitor ECC-retired memory pages
gpu_temperatureThermal monitoringValidate GPU temperatures below threshold
row_remap_failedFailed row remapsDetect failed row remap operations
row_remap_pendingPending row remapsEnsure no pending row remaps
row_remapRow remapping statusCheck for pending/failed row remaps
running_procs_and_killProcess cleanupRetry logic with optional force-kill capability
running_procsProcess occupancy checkDetect processes using GPUs
vbios_mismatchVBIOS consistencyVerify consistent VBIOS across GPUs

Quick Start

# GPU count check
health_checks check-nvidia-smi --check gpu_num --gpu_num 8 [CLUSTER] app

# Multiple checks
health_checks check-nvidia-smi --check gpu_num --check clock_freq --check running_procs [CLUSTER] app

# Temperature validation
health_checks check-nvidia-smi --check gpu_temperature --gpu_temperature_threshold 85 [CLUSTER] app