Skip to main content

kubernetes_monitor

Overview

Collects Kubernetes pod and node condition metrics for SUNK (Slurm-on-Kubernetes) cluster monitoring. Provides visibility into pod lifecycle states, container restart counts, node health conditions, and Slurm-K8s job correlation via annotations.

Data Type: DataType.METRIC, Schemas: KubernetesPodPayload, KubernetesNodePayload

Execution Scope

Single node in the cluster with access to the Kubernetes API (typically a management node or pod with ServiceAccount).

Prerequisites

Install the optional Kubernetes dependency:

pip install 'gpucm[kubernetes]'

Output Schema

KubernetesPodPayload

Published with DataType.METRIC and DataIdentifier.K8S_POD:

{
"ds": str, # Collection date (YYYY-MM-DD in Pacific time)
"collection_unixtime": int, # Unix timestamp of collection
"cluster": str, # Cluster identifier
"derived_cluster": str, # Sub-cluster (optional)
"pod": {
"name": str, # Pod name
"namespace": str, # Kubernetes namespace
"node_name": str, # Node the pod is scheduled on
"phase": str, # Pod phase (Pending/Running/Succeeded/Failed/Unknown)
"restart_count": int, # Container restart count
"container_name": str, # Container name (one row per container)
"slurm_job_id": str, # Slurm job ID from annotation slurm.coreweave.com/job-id
}
}

KubernetesNodePayload

Published with DataType.METRIC and DataIdentifier.K8S_NODE:

{
"ds": str, # Collection date (YYYY-MM-DD in Pacific time)
"collection_unixtime": int, # Unix timestamp of collection
"cluster": str, # Cluster identifier
"derived_cluster": str, # Sub-cluster (optional)
"node_condition": {
"name": str, # Node name
"condition_type": str, # Condition type (Ready, MemoryPressure, DiskPressure, etc.)
"status": str, # Condition status (True/False/Unknown)
"reason": str, # Machine-readable reason
"message": str, # Human-readable message
}
}

Important Notes:

  1. Each container in a pod creates a separate record (one row per container)
  2. Pods without containers (e.g., Pending) still produce one record with restart_count=0
  3. The slurm_job_id field correlates Kubernetes pods with Slurm jobs in SUNK environments

Command-Line Options

OptionTypeDefaultDescription
--clusterStringAuto-detectedCluster name for metadata enrichment
--sinkStringRequiredSink destination, see Exporters
--sink-optsMultiple-Sink-specific options
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--stdoutFlagFalseDisplay metrics to stdout in addition to logs
--intervalInteger60Seconds between collection cycles (1 minute)
--onceFlagFalseRun once and exit (no continuous monitoring)
--retriesIntegerShared defaultRetry attempts on sink failures
--dry-runFlagFalsePrint to stdout instead of publishing to sink
--chunk-sizeIntegerShared defaultMaximum size in bytes of each chunk when writing data to sink
--namespaceString"" (all)Kubernetes namespace to filter pods
--in-cluster/--no-in-clusterFlagTrueUse in-cluster ServiceAccount or local kubeconfig
--label-selectorString""Kubernetes label selector to filter pods (e.g., app=slurm)

Usage Examples

Basic Continuous Collection (In-Cluster)

gcm kubernetes_monitor --sink otel

One-Time Snapshot

gcm kubernetes_monitor --once --sink stdout --dry-run

Filter by Namespace with Label Selector

gcm kubernetes_monitor \
--sink otel \
--namespace slurm-jobs \
--label-selector "app=slurm-worker"

Using Kubeconfig (Out-of-Cluster)

gcm kubernetes_monitor \
--no-in-cluster \
--once \
--sink stdout \
--cluster my-sunk-cluster

Debug Mode

gcm kubernetes_monitor \
--once \
--log-level DEBUG \
--stdout \
--dry-run \
--no-in-cluster