check-nvidia-smi
Comprehensive GPU health validation using NVIDIA's NVML (NVIDIA Management Library) API for GPU hardware state, memory integrity, thermal status, and process occupancy.
Available Health Checks
| Check | Purpose | Key Feature |
|---|---|---|
| clock_freq | Clock frequency validation | Ensure GPU/memory clocks meet minimums |
| ecc_corrected_volatile_total | Corrected ECC errors | Monitor corrected error accumulation |
| ecc_uncorrected_volatile_total | Uncorrected ECC errors | Validate uncorrected error counts |
| gpu_mem_usage | Memory usage check | Verify GPU memory usage below limit |
| gpu_num | GPU count validation | Verify expected number of GPUs detected |
| gpu_retired_pages | Retired pages tracking | Monitor ECC-retired memory pages |
| gpu_temperature | Thermal monitoring | Validate GPU temperatures below threshold |
| row_remap_failed | Failed row remaps | Detect failed row remap operations |
| row_remap_pending | Pending row remaps | Ensure no pending row remaps |
| row_remap | Row remapping status | Check for pending/failed row remaps |
| running_procs_and_kill | Process cleanup | Retry logic with optional force-kill capability |
| running_procs | Process occupancy check | Detect processes using GPUs |
| vbios_mismatch | VBIOS consistency | Verify consistent VBIOS across GPUs |
Quick Start
# GPU count check
health_checks check-nvidia-smi --check gpu_num --gpu_num 8 [CLUSTER] app
# Multiple checks
health_checks check-nvidia-smi --check gpu_num --check clock_freq --check running_procs [CLUSTER] app
# Temperature validation
health_checks check-nvidia-smi --check gpu_temperature --gpu_temperature_threshold 85 [CLUSTER] app