Skip to main content

One post tagged with "Open Source"

View All Tags

Welcome to the GCM Blog

· 2 min read
Lucca Bertoncini
GCM Core Maintainer

Welcome to the GPU Cluster Monitoring (GCM) blog!

What is GCM?

GCM is an open-source suite of tools for Large-Scale AI Research Cluster Monitoring. It includes:

  • Monitoring: Collects cluster statistics from Slurm, providing visibility into job performance and resource utilization
  • Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
  • Telemetry Processor: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata.

GCM builds on top of other open-source tools released by FAIR:

  • Clusterscope - A CLI and Python library to extract information from HPC clusters and Slurm jobs. Clusterscope automatically detects cluster configuration and provides a unified interface to query cluster resources.

  • GPU Node ID (GNI) - Creates unique node fingerprints by hashing all GPU IDs (hash(GPU₀ + GPU₁ + ... + GPUₙ)), giving each node a stable identity across reboots.

Learn More

Get Involved

We welcome contributions from the community. Here's how you can get started:

  • Check out the docs to get started
  • Browse open issues for ways to contribute
  • Submit an RFC for design proposals
  • Join the conversation on Discord

Stay tuned for more updates!