Getting Started
GCM Monitoring is a Python CLI with a series of collectors (Slurm and GPU (NVML)) data in a loop and publishes it to a given exporter.
For a cluster-level view of GCM Monitoring:
On a Kubernetes environment:
Quick Start Guide
Requirements
- Python 3.10+
- pynvml (for GPU monitoring features)
- Slurm (for job scheduler integration)
Installation
pip install gpucm
Install from Github
# Latest main
pip install --upgrade git+ssh://git@github.com:facebookresearch/gcm.git@main
# Specific release
pip install --upgrade git+ssh://git@github.com:facebookresearch/gcm.git@<release>
CLI
After installing gcm you should be able to call it via the CLI:
$ gcm --help
Usage: gcm [OPTIONS] COMMAND [ARGS]...
GPU Cluster Monitoring: Large-Scale AI Research Cluster Monitoring.
Options:
--config FILE Load option values from table 'gcm' in the given TOML config
file. A non-existent path or '/dev/null' are ignored and
treated as empty tables. [default: /etc/fb-gcm/config.toml]
-d, --detach Exit immediately instead of waiting for GCM to run.
--version Show the version and exit.
--help Show this message and exit.
Commands:
fsacct
nvml_monitor Script for reading gpu metrics on the node.
sacct_backfill A script to backfill sacct data into sink.
sacct_publish Take the output of sacct SACCT_OUTPUT in the...
sacct_running Collects slurm running jobs through sacct and sends...
sacctmgr_qos Collects slurm QOS information and sends to sink.
sacctmgr_user Collects slurm user information and sends to sink.
scontrol Collects slurm scontrol partition information and...
scontrol_config Collects slurm scontrol config information and sends...
slurm_job_monitor Retrieve SLURM node and job metrics.
slurm_monitor Publish SLURM metrics and logs from sacct, sdiag,...
storage Collects slurm storage partition information and...
GCM Collectors
Check out GCM Collectors documentation for a list of supported collectors.
Each collector can be invoked via the CLI:
$ gcm <collector> --help
At a high level you can think of the collection above as a daemon that runs non-stop, collects some data at every X interval, and exports it.
Configuration Files
GCM Monitoring supports TOML configuration files to simplify parameter management:
# FILE: cluster_config.toml
[gcm.slurm_monitor] <-- This is where you specific the collector name
# All CLI options are supported in the config file
...
sink = "otel"
sink_opts = [
"otel_endpoint=https://otlp.observability-platform.com",
"otel_timeout=60",
"log_resource_attributes={'environment': 'production', 'cluster': 'gpu-cluster-a', 'region': 'us-west-2'}",
"metric_resource_attributes={'environment': 'production', 'cluster': 'gpu-cluster-a', 'region': 'us-west-2'}",
]
You can then invoke the collector with the config file:
# Using config file for monitoring
gcm --config=/etc/cluster_config.toml slurm_monitor
# CLI arguments override config file values
gcm --config=/etc/cluster_config.toml slurm_monitor --sink=stdout --once
License
GCM Monitoring is licensed under the MIT License.