Skip to main content

Getting Started

Introduction

GCM is a set of tools used to do at-scale monitoring for HPC (High-Performance Computing) clusters, it powers Meta FAIR (Fundamental AI Research) AI workloads across hundreds of thousands of GPUs at Meta.

GCM is a monorepo with the following components:

  • GCM Monitoring: Continuous data collection, mostly for the Slurm workload scheduler, providing visibility into job performance and resource utilization.
  • GCM Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
  • GCM GPU Metrics: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata, enabling attribution of metrics (e.g., GPU utilization) to specific jobs and users.

Each component has their own Getting Started and Contributing Guide:

Getting Started

Contributing

Others

Community

You can ask questions, open issues and feature-requests in Github.

Citing GCM

If you use GCM in your research please use the following BibTeX entry:

@software{Bertoncini_Meta_GPU_Cluster,
title = {Meta GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring},
author = {Bertoncini, Lucca and Ho, Caleb and Kokolis, Apostolos and Hu, Liao and Nguyen, Thanh and Campoli, Billy and Doku, Jörg and Peng, Vivian and Wang, Max and Verma, Sujit and Li, Teng and Saxena, Neha and Johnson, Jakob and Malani, Parth and Saladi, Kalyan and Sengupta, Shubho},
howpublished = {Github},
year = {2025},
url = {https://github.com/facebookresearch/gcm/}
}