Skip to main content

Kubernetes Deployment

GCM Health Checks can be deployed on Kubernetes GPU clusters using Node Problem Detector (NPD) as a DaemonSet. NPD runs each GCM health check as a subprocess at a configurable interval and reports results as Kubernetes node conditions.

Architecture

DaemonSet (one per GPU node)
└── Pod
└── Container: node-problem-detector (NPD)
├── Invokes: health_checks check-syslogs xid ...
├── Invokes: health_checks check-nvidia-smi ...
├── Invokes: health_checks check-dcgmi ...
└── ...

NPD is the scheduler — it runs each health check, manages retries (configurable via healthChecks.maxRetries and healthChecks.retrySleep) and concurrency, and reports results as:

  • Kubernetes node conditions (e.g., GcmXidErrorsProblem) that downstream tools can act on
  • Prometheus metrics for dashboarding and alerting

GCM health_checks does the actual GPU inspection.

Helm Chart

The recommended way to deploy on Kubernetes is via the GCM Helm chart:

helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=false \
--set healthChecks.enabled=true

Or from source:

helm install gcm charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=false \
--set healthChecks.enabled=true

See the Helm chart README for full configuration options.

Default Health Checks

The Helm chart runs 6 health checks every 5 minutes by default:

CheckDescriptionNPD Condition
XID ErrorsScans syslogs for NVIDIA XID errorsGcmXidErrorsProblem
ECC ErrorsChecks uncorrected/corrected ECC countersGcmSmiEccProblem
GPU DisconnectedVerifies expected GPU count is visibleGcmSmiDisconnectedProblem
Zombie ProcessesDetects zombie GPU processesGcmProcZombieProblem
DCGM NVLink StatusChecks NVLink health via DCGMGcmDcgmiNvlinkStatusProblem
DCGM Diag Level 1Runs DCGM level 1 diagnosticsGcmDcgmiDiagProblem

Adding Extra Checks

Additional checks can be enabled via healthChecks.extraChecks in your values file. Each entry adds a new NPD condition and check rule without modifying the chart:

healthChecks:
extraChecks:
- condition: GcmPciProblem
message: "PCI device check passed"
args: ["check-pci"]
- condition: GcmSensorsProblem
message: "Sensor readings check passed"
args: ["check-sensors"]
- condition: GcmRowRemapProblem
message: "GPU row remap check passed"
args: ["check-nvidia-smi", "-c", "row_remap"]

Run health_checks --help to see all available checks, or see the health checks reference for detailed documentation on each check and its arguments.

Querying Node Conditions

All GCM conditions are prefixed with Gcm:

# List all GCM conditions on a node
kubectl get node <node-name> \
-o jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}' \
| grep Gcm

Acting on Results

Health check conditions are standard Kubernetes node conditions. Downstream tools can watch them to automatically remediate unhealthy nodes:

  • Draino: watches node conditions and automatically cordons/drains unhealthy nodes
  • Cluster autoscalers: can detect unhealthy nodes and replace them
  • Custom controllers: can implement organization-specific remediation workflows
  • OTel sink: export results to your observability stack (e.g., Grafana) for alerting and reporting to infrastructure providers

Docker Image

The health checks DaemonSet uses a combined NPD+GCM Docker image. Build it after building the base GCM image:

# Build the base GCM image first
docker build -f docker/Dockerfile -t gcm:latest .

# Build the NPD-GCM combined image
docker build -f docker/Dockerfile.npd -t gcm-npd:latest .

The NPD image bundles Node Problem Detector v0.8.19 with all GCM health check binaries.

Security Requirements

The health checks DaemonSet requires elevated privileges:

  • privileged: true: direct access to GPU devices, host PCI topology, and DCGM diagnostics
  • hostPID: true: visibility into host processes for zombie detection and GPU process inspection
  • hostNetwork: true: connectivity to the host's DCGM daemon for diagnostics
  • CAP_SYSLOG: access to kernel ring buffer for XID error detection via dmesg
  • Dedicated ServiceAccount: minimal RBAC permissions (node status patching for NPD conditions)
  • priorityClassName: system-node-critical: prevents eviction under resource pressure

Non-Kubernetes Deployment

For bare-metal or non-Kubernetes environments, health checks can be run directly via the CLI or as systemd services. See the Getting Started guide for CLI usage.