GCM Health Checks
Comprehensive validation suite for GPU clusters. Verify system health, hardware functionality, network connectivity, and configuration correctness across compute nodes.
GCM Monitoring
Collect and export Slurm job scheduler and GPU (NVML) metrics in a loop. Support for multiple exporters including OTLP, Prometheus, and custom sinks.
GCM GPU Metrics
Process and analyze GPU telemetry data from Slurm workloads. Extract insights from job performance metrics and resource utilization patterns.