Skip to main content

Kubernetes Deployment

GCM Monitoring can be deployed on Kubernetes GPU clusters as a DaemonSet that runs gcm nvml_monitor on every GPU node.

Architecture

DaemonSet (one per GPU node)
└── Pod
└── Container: gcm nvml_monitor
└── Runs: gcm nvml_monitor --sink otel --cluster my-cluster --interval 60

The monitoring DaemonSet continuously collects per-device GPU metrics via NVML:

  • Per-GPU: utilization, memory usage, temperature, power draw, ECC retired pages
  • Per-GPU job association: Slurm job ID, user, partition, and resource allocation
  • Host-level: min/max/avg GPU utilization, RAM utilization

Job association works by reading /proc/<pid>/environ of GPU compute processes to extract Slurm environment variables (SLURM_JOB_ID, SLURM_JOB_USER, etc.).

Helm Chart

The recommended way to deploy on Kubernetes is via the GCM Helm chart:

helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false

Or from source:

helm install gcm charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false

See the Helm chart README for full configuration options.

Configuration

ParameterDescriptionDefault
monitoring.sinkExporter sink for metrics (e.g., otel, stdout)""
monitoring.clusterCluster name for metrics""
monitoring.intervalCollection interval in seconds60
monitoring.sinkOptsSink options (-o, OmegaConf dot-list syntax)[]
monitoring.extraArgsAdditional CLI arguments for gcm nvml_monitor[]
monitoring.extraEnvAdditional environment variables[]

Sending Metrics to OpenTelemetry

helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.extraEnv[0].name=OTEL_EXPORTER_OTLP_ENDPOINT \
--set monitoring.extraEnv[0].value=http://otel-collector:4318

Sink-specific options can also be passed via monitoring.sinkOpts:

helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace>
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.sinkOpts[0]=otel_endpoint=http://otel-collector:4318 \
--set "monitoring.sinkOpts[1]=metric_resource_attributes={'environment': 'production'}"

Run gcm nvml_monitor --help to see all available sinks and their options.

Docker Image

The monitoring DaemonSet uses the base GCM Docker image:

docker build -f docker/Dockerfile -t gcm:latest .

Security Requirements

The monitoring DaemonSet requires:

  • runAsUser: 0 (root): needed to read /proc/<pid>/environ of GPU compute processes for Slurm job association
  • hostPID: true: NVML reports GPU process PIDs in the host PID namespace, so the container needs host PID namespace visibility
  • NVIDIA_VISIBLE_DEVICES=all: GPU access without reserving any GPU resources
  • priorityClassName: system-node-critical: prevents eviction under resource pressure

The monitoring DaemonSet does not require privileged mode or hostNetwork.

Non-Kubernetes Deployment

For bare-metal or non-Kubernetes environments, gcm nvml_monitor can be run directly as a systemd service. See the Getting Started guide for CLI usage.