Skip to main content

nvml_monitor

Overview

Collects GPU metrics from NVIDIA GPUs using the NVML (NVIDIA Management Library) API and publishes aggregated metrics at regular intervals. Provides real-time monitoring of GPU utilization, memory usage, power consumption, temperature, SLURM job information, and host-level metrics including RAM utilization.

Data Type: DataType.LOG, Schemas: DevicePlusJobMetrics

Data Type: DataType.METRIC, Schemas: HostMetrics, IndexedDeviceMetrics

Execution Scope

All GPU nodes in the cluster.

Metrics Collected

DevicePlusJobMetrics

Published with DataType.LOG, 1 sample per host:

{
# Device Identification
"gpu_id": int, # GPU index
"hostname": str, # Node hostname

# GPU Metrics
"gpu_util": int, # GPU utilization (%)
# Temperature & Power
"temperature": int, # GPU temperature (°C)
"power_draw": float, # Power usage (W)
"power_used_percent": int, # Power usage (%)
# Error Counts
"retired_pages_count_single_bit": int, # Single-bit ECC errors
"retired_pages_count_double_bit": int, # Double-bit ECC errors
# GPU Memory
"mem_util": int, # Memory utilization (%)
"mem_used_percent": int, # Memory used (%)

# SLURM Job Info (if job running on GPU)
"job_id": str | None, # SLURM job ID
"job_user": str | None, # Job owner
"job_gpus": int | None, # GPUs allocated to job
"job_num_gpus": int | None, # Number of GPUs used by job
"job_num_cpus": int | None, # Number of CPUs used by job
"job_name": str | None, # Job name
"job_num_nodes": int | None, # Number of nodes allocated
"job_partition": str | None, # Job partition
"job_cpus_per_gpu": int | None, # CPUs per GPU: job_num_cpus / job_num_gpus
}

HostMetrics (Aggregated)

Published with DataType.METRIC:

{
# Host-Level GPU Aggregates
"max_gpu_util": float, # Highest GPU utilization across all GPUs (%)
"min_gpu_util": float, # Lowest GPU utilization across all GPUs (%)
"avg_gpu_util": float, # Average GPU utilization (%)

# Host RAM
"ram_util": float, # Host RAM utilization (0.0-1.0)
}

IndexedDeviceMetrics (Per GPU Aggregated)

Published with DataType.METRIC:

{
# GPU Metrics
"gpu_util": int, # GPU utilization (%)
# Temperature & Power
"temperature": int, # GPU temperature (°C)
"power_draw": float, # Power usage (W)
"power_used_percent": int, # Power usage (%)
# Error Counts
"retired_pages_count_single_bit": int, # Single-bit ECC errors
"retired_pages_count_double_bit": int, # Double-bit ECC errors
# GPU Memory
"mem_util": int, # Memory utilization (%)
"mem_used_percent": int, # Memory used (%)
}

Command-Line Options

Output

OptionTypeDefaultDescription
--collect-intervalInteger10 secondsFrequency to sample telemetry data
--push-intervalInteger60 secondsFrequency to publish aggregated metrics
--intervalInteger90 secondsFrequency to restart collection cycle
--clusterStringAuto-detectedCluster name for metadata enrichment
--sinkStringRequiredSink destination, see Exporters
--sink-optsMultiple-Sink-specific options
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--stdoutFlagFalseDisplay metrics to stdout in addition to logs
--heterogeneous-cluster-v1FlagFalseEnable per-partition metrics for heterogeneous clusters
--intervalInteger300Seconds between collection cycles (5 minutes)
--onceFlagFalseRun once and exit (no continuous monitoring)
--retriesIntegerShared defaultRetry attempts on sink failures
--dry-runFlagFalsePrint to stdout instead of publishing to sink

Usage Examples

Basic Continuous Monitoring

gcm nvml_monitor --sink file --sink-opts filepath=/tmp/gpu_metrics.json

One-Time Collection

gcm nvml_monitor --once --sink stdout

Custom Intervals

# Sample every 5s, publish every 30s, restart every 60s
gcm nvml_monitor \
--collect-interval 5 \
--push-interval 30 \
--interval 60 \
--sink file --sink-opts filepath=/tmp/gpu_metrics.json

Debug Mode with Console Output

gcm nvml_monitor \
--log-level DEBUG \
--stdout \
--sink stdout