Skip to main content
GCM LogoGCM Logo

GCM: Large-Scale AI Research Cluster Monitoring.


GCM Health Checks

Comprehensive validation suite for GPU clusters. Verify system health, hardware functionality, network connectivity, and configuration correctness across compute nodes.

GCM Monitoring

Collect and export Slurm job scheduler and GPU (NVML) metrics in a loop. Support for multiple exporters including OTLP, Prometheus, and custom sinks.

GCM GPU Metrics

Process and analyze GPU telemetry data from Slurm workloads. Extract insights from job performance metrics and resource utilization patterns.