Welcome to the GCM Blog
· 2 min read
Welcome to the GPU Cluster Monitoring (GCM) blog!
What is GCM?
GCM is an open-source suite of tools for Large-Scale AI Research Cluster Monitoring. It includes:
- Monitoring: Collects cluster statistics from Slurm, providing visibility into job performance and resource utilization
- Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
- Telemetry Processor: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata.
Related Tools
GCM builds on top of other open-source tools released by FAIR:
-
Clusterscope - A CLI and Python library to extract information from HPC clusters and Slurm jobs. Clusterscope automatically detects cluster configuration and provides a unified interface to query cluster resources.
-
GPU Node ID (GNI) - Creates unique node fingerprints by hashing all GPU IDs (
hash(GPU₀ + GPU₁ + ... + GPUₙ)), giving each node a stable identity across reboots.
Learn More
Get Involved
We welcome contributions from the community. Here's how you can get started:
- Check out the docs to get started
- Browse open issues for ways to contribute
- Submit an RFC for design proposals
- Join the conversation on Discord
Stay tuned for more updates!