Health Checks Documentation
Comprehensive documentation for all GCM health checks. Health checks validate system health, hardware functionality, and configuration correctness across compute nodes.
Overview
Health checks are organized into the following categories:
Quick Reference
| Check | Category | Purpose |
|---|---|---|
| check-airstore | Storage | Validates Flash Array credential configuration |
| check-authentication | Security | Verifies password status and file access permissions |
| check-blockdev | Storage | Monitors NVMe device health via SMART data |
| check-dcgmi | GPU | NVIDIA DCGM diagnostics and NVLink validation |
| check-ethlink | Network | Validates Ethernet interface configuration and state |
| check-hca | Network | Verifies InfiniBand HCA device count |
| check-ib | Network | InfiniBand link health and performance validation |
| check-ipmitool | Hardware | System Event Log (SEL) analysis for hardware errors |
| check-nccl | GPU | NCCL collective operation performance testing |
| check-node | System | Node uptime, kernel modules, and package repositories |
| check-nvidia-smi | GPU | Comprehensive GPU health validation via NVML |
| check-pci | Hardware | PCI device presence and PCIe link validation |
| check-process | Process | Process existence and state validation |
| check-processor | CPU | CPU/processor configuration validation |
| check-sensors | Hardware | Fan and PSU sensor monitoring via IPMI |
| check-service | System | Service status and package version validation |
| check-ssh-certs | Security | SSH certificate validation against IPA |
| check-storage | Storage | Disk usage, mounts, and file system validation |
| check-syslogs | System | System log analysis for hardware and network errors |
| check-telemetry | Utility | Telemetry publishing test utility |
| memtest | GPU | GPU memory integrity testing |
Exit Codes
All health checks follow a standard exit code convention:
| Exit Code | Description | |-----------|--------|-------------| | OK (0) | Check passed successfully | | WARN (1) | Non-critical issues detected | | CRITICAL (2) | Critical issues detected | | UNKNOWN (3) | Check could not complete |
Command-Line Options
Common Options:
| Option | Type | Default | Description |
|---|---|---|---|
--timeout | Integer | 300 | Command execution timeout in seconds |
--sink | String | do_nothing | Telemetry sink destination |
--sink-opts | Multiple | - | Sink-specific configuration |
--verbose-out | Flag | False | Display detailed output |
--log-level | Choice | INFO | DEBUG, INFO, WARNING, ERROR, CRITICAL |
--log-folder | String | /var/log/fb-monitoring | Log directory |
--heterogeneous-cluster-v1 | Flag | False | Enable heterogeneous cluster support |
Check-specific options are documented in each health check's page.
Feature Flag Disabled
If a check is disabled via feature flag, it will return OK immediately without performing validation. Check feature flag configuration if unexpected.
Contributing
When adding new health checks:
- Implement in
gcm/gcm/health_checks/checks/ - Create documentation following the format of existing checks
- Add entry to this README in appropriate category
- Include usage examples
- Document all command-line options and exit conditions