Kubernetes Deployment
GCM Monitoring can be deployed on Kubernetes GPU clusters as a DaemonSet that runs gcm nvml_monitor on every GPU node.
Architecture
DaemonSet (one per GPU node)
└── Pod
└── Container: gcm nvml_monitor
└── Runs: gcm nvml_monitor --sink otel --cluster my-cluster --interval 60
The monitoring DaemonSet continuously collects per-device GPU metrics via NVML:
- Per-GPU: utilization, memory usage, temperature, power draw, ECC retired pages
- Per-GPU job association: Slurm job ID, user, partition, and resource allocation
- Host-level: min/max/avg GPU utilization, RAM utilization
Job association works by reading /proc/<pid>/environ of GPU compute processes to extract Slurm environment variables (SLURM_JOB_ID, SLURM_JOB_USER, etc.).
Helm Chart
The recommended way to deploy on Kubernetes is via the GCM Helm chart:
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false
Or from source:
helm install gcm charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false
See the Helm chart README for full configuration options.
Configuration
| Parameter | Description | Default |
|---|---|---|
monitoring.sink | Exporter sink for metrics (e.g., otel, stdout) | "" |
monitoring.cluster | Cluster name for metrics | "" |
monitoring.interval | Collection interval in seconds | 60 |
monitoring.sinkOpts | Sink options (-o, OmegaConf dot-list syntax) | [] |
monitoring.extraArgs | Additional CLI arguments for gcm nvml_monitor | [] |
monitoring.extraEnv | Additional environment variables | [] |
Sending Metrics to OpenTelemetry
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.extraEnv[0].name=OTEL_EXPORTER_OTLP_ENDPOINT \
--set monitoring.extraEnv[0].value=http://otel-collector:4318
Sink-specific options can also be passed via monitoring.sinkOpts:
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace>
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.sinkOpts[0]=otel_endpoint=http://otel-collector:4318 \
--set "monitoring.sinkOpts[1]=metric_resource_attributes={'environment': 'production'}"
Run gcm nvml_monitor --help to see all available sinks and their options.
Docker Image
The monitoring DaemonSet uses the base GCM Docker image:
docker build -f docker/Dockerfile -t gcm:latest .
Security Requirements
The monitoring DaemonSet requires:
runAsUser: 0(root): needed to read/proc/<pid>/environof GPU compute processes for Slurm job associationhostPID: true: NVML reports GPU process PIDs in the host PID namespace, so the container needs host PID namespace visibilityNVIDIA_VISIBLE_DEVICES=all: GPU access without reserving any GPU resourcespriorityClassName: system-node-critical: prevents eviction under resource pressure
The monitoring DaemonSet does not require privileged mode or hostNetwork.
Non-Kubernetes Deployment
For bare-metal or non-Kubernetes environments, gcm nvml_monitor can be run directly as a systemd service. See the Getting Started guide for CLI usage.