Skip to main content

Getting Started

GCM Health Checks is a Python CLI with a suite of Health Checks

Quick Start Guide

Requirements

  • Python 3.10+
  • System utilities (varies by check: nvidia-smi, dcgmi, sensors, etc.)
  • Slurm (for job scheduler integration checks)

Installation

pip install gpucm

Install from Github

# Latest main
pip install --upgrade git+ssh://git@github.com:facebookresearch/gcm.git@main

# Specific release
pip install --upgrade git+ssh://git@github.com:facebookresearch/gcm.git@<release>

CLI

After installing gcm you should be able to call health checks via the CLI:

$ health_checks --help
Usage: health_checks [OPTIONS] COMMAND [ARGS]...

GPU Cluster Monitoring: Large-Scale AI Research Cluster Monitoring.

Options:
--features-config FILE Path parameter for the features config file, to load
feature values.
--config FILE Load option values from table 'health_checks' in the
given TOML config file. A non-existent path or
'/dev/null' are ignored and treated as empty tables.
[default: /etc/fb-healthchecks/config.toml]
-d, --detach Exit immediately instead of waiting for GCM to run.
--version Show the version and exit.
--help Show this message and exit.

Commands:
check-airstore AIRStore-based application readiness checks
check-authentication authentication based checks.
check-blockdev Check block devices against the manifest file
check-dcgmi dcgmi based commands.
check-ethlink Check eth links against the manifest file
check-hca Check if HCAs are present and count matches the...
check-ib ib status checks.
check-ipmitool ipmitool based checks.
check-nccl Run NCCL tests to check both the performance and...
check-node various node based checks.
check-nvidia-smi Perform nvidia-smi checks to assess the state of...
check-pci Check pci subsystem against the manifest file.
check-process A collection of process related checks.
check-processor processor based checks.
check-sensors Invoke ipmi-sensors and return the output.
check-service check the system services.
check-ssh-certs Check hostkeys against ipa certs.
check-storage storage based checks.
check-syslogs syslog based checks.
check-telemetry Perform only the telemetry for health-checks
cuda A collection of CUDA related checks.

Available Health Checks

Check out GCM Health Checks documentation.

Configuration Files

GCM Health Checks supports TOML configuration files to simplify parameter management:

# FILE: health_check_config.toml

[health_checks.check-nvidia-smi]
timeout = 60
sink = "otel"
sink_opts = [
"otel_endpoint=https://otlp.observability-platform.com",
"otel_timeout=30",
]

You can then invoke checks with the config file:

# Using config file
health_checks --config=/etc/health_check_config.toml check-nvidia-smi gpu-present [CLUSTER] app

# CLI arguments override config file values
health_checks --config=/etc/health_check_config.toml check-nvidia-smi gpu-present --timeout=30 [CLUSTER] app

License

GCM Health Checks is licensed under the MIT License.