Skip to main content

Health Checks Documentation

Comprehensive documentation for all GCM health checks. Health checks validate system health, hardware functionality, and configuration correctness across compute nodes.

Overview

Health checks are organized into the following categories:

Quick Reference

CheckCategoryPurpose
check-airstoreStorageValidates Flash Array credential configuration
check-authenticationSecurityVerifies password status and file access permissions
check-blockdevStorageMonitors NVMe device health via SMART data
check-dcgmiGPUNVIDIA DCGM diagnostics and NVLink validation
check-ethlinkNetworkValidates Ethernet interface configuration and state
check-hcaNetworkVerifies InfiniBand HCA device count
check-ibNetworkInfiniBand link health and performance validation
check-ipmitoolHardwareSystem Event Log (SEL) analysis for hardware errors
check-ncclGPUNCCL collective operation performance testing
check-nodeSystemNode uptime, kernel modules, and package repositories
check-nvidia-smiGPUComprehensive GPU health validation via NVML
check-pciHardwarePCI device presence and PCIe link validation
check-processProcessProcess existence and state validation
check-processorCPUCPU/processor configuration validation
check-sensorsHardwareFan and PSU sensor monitoring via IPMI
check-serviceSystemService status and package version validation
check-ssh-certsSecuritySSH certificate validation against IPA
check-storageStorageDisk usage, mounts, and file system validation
check-syslogsSystemSystem log analysis for hardware and network errors
check-telemetryUtilityTelemetry publishing test utility
memtestGPUGPU memory integrity testing

Exit Codes

All health checks follow a standard exit code convention:

| Exit Code | Description | |-----------|--------|-------------| | OK (0) | Check passed successfully | | WARN (1) | Non-critical issues detected | | CRITICAL (2) | Critical issues detected | | UNKNOWN (3) | Check could not complete |

Command-Line Options

Common Options:

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Check-specific options are documented in each health check's page.

Feature Flag Disabled

If a check is disabled via feature flag, it will return OK immediately without performing validation. Check feature flag configuration if unexpected.

Contributing

When adding new health checks:

  1. Implement in gcm/gcm/health_checks/checks/
  2. Create documentation following the format of existing checks
  3. Add entry to this README in appropriate category
  4. Include usage examples
  5. Document all command-line options and exit conditions