📄️ Getting Started
GCM Monitoring is a Python CLI with a series of collectors that collect Slurm and GPU (NVML) data in a loop and publish it to a given exporter.
📄️ Kubernetes Deployment
GCM Monitoring can be deployed on Kubernetes GPU clusters as a DaemonSet that runs gcm nvml_monitor on every GPU node.
🗃️ Collectors
14 items
📄️ Telemetry Types
GCM supports two types of telemetry
📄️ Adding New Collector
GCM should be easily extensible. To monitor something new with GCM, you'll need to:
🗃️ Exporters
6 items
📄️ Adding New Exporter
GCM supports multiple exporters, each one is responsible for exporting data to a different destination. To add a new exporter, you'll need to:
📄️ Slurm REST API Client
The SlurmRestClient provides the same SlurmClient interface as the CLI-based SlurmCliClient, but queries the Slurm REST API (slurmrestd) over HTTP instead of executing subprocess commands. This is useful for environments where Slurm CLI tools are not installed on monitoring hosts.
📄️ Contributing
Check out GCM Monitoring contributing guide here.