Skip to main content

Welcome to the GCM Blog

March 1, 2026 · 2 min read

Lucca Bertoncini

GCM Core Maintainer

Welcome to the GPU Cluster Monitoring (GCM) blog!

What is GCM?

GCM is an open-source suite of tools for Large-Scale AI Research Cluster Monitoring. It includes:

Monitoring: Collects cluster statistics from Slurm, providing visibility into job performance and resource utilization
Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
Telemetry Processor: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata.

GCM builds on top of other open-source tools released by FAIR:

Clusterscope - A CLI and Python library to extract information from HPC clusters and Slurm jobs. Clusterscope automatically detects cluster configuration and provides a unified interface to query cluster resources.
GPU Node ID (GNI) - Creates unique node fingerprints by hashing all GPU IDs (hash(GPU₀ + GPU₁ + ... + GPUₙ)), giving each node a stable identity across reboots.

Learn More

Get Involved

We welcome contributions from the community. Here's how you can get started:

Check out the docs to get started
Browse open issues for ways to contribute
Submit an RFC for design proposals
Join the conversation on Discord

Stay tuned for more updates!

What is GCM?
Related Tools
Learn More
Get Involved