Skip to main content

slurm_job_monitor

Overview

Collects dual-stream real-time data from SLURM: node metrics via sinfo and job metrics via squeue. Provides lightweight, high-frequency snapshots of cluster infrastructure state and active workload for real-time monitoring, capacity planning, resource utilization tracking, and bottleneck detection.

Data Type: DataType.LOG

Data Identifiers: DataIdentifier.NODE (node data), DataIdentifier.JOB (job data)

Schemas: NodeData (nodes), JobData (jobs)

The collector publishes two separate data streams with distinct DataIdentifier values for independent indexing, scaling, and targeted analysis.

Execution Scope

Single node in the cluster.

Output Schema

NodeData (Node Infrastructure)

Published with DataType.LOG and DataIdentifier.NODE:

{
# Metadata
"num_rows": int, # Total nodes in this collection
"collection_unixtime": int, # Collection timestamp (Unix epoch)
"cluster": str, # Cluster name
"derived_cluster": str, # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)

# Node Identification
"NODE_NAME": str, # Node hostname
"PARTITION": str, # Partition assignment

# CPU Resources
"CPUS_ALLOCATED": int, # CPUs currently allocated to jobs
"CPUS_IDLE": int, # CPUs available (idle)
"CPUS_OTHER": int, # CPUs in other state
"CPUS_TOTAL": int, # Total CPUs on node

# Memory Resources
"FREE_MEM": int | None, # Free memory (MB)
"MEMORY": int | None, # Total memory (MB)

# GPU Resources
"NUM_GPUS": int, # Number of GPUs on node

# Node State
"STATE": str, # Node state (idle, allocated, down, etc.)
"REASON": str, # Reason for down/drain state
"USER": str, # User if node reserved
"RESERVATION": str, # Reservation name if applicable

# Node Metadata
"TIMESTAMP": str, # Last sinfo update
"ACTIVE_FEATURES": str, # Active node features/constraints
}

JobData (Job Queue)

Published with DataType.LOG:

{
# Metadata
"collection_unixtime": int, # Collection timestamp (Unix epoch)
"cluster": str, # Cluster name
"derived_cluster": str, # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)

# Job Identification
"JOBID": str, # Job array ID
"JOBID_RAW": str, # Raw job ID (includes array indices)
"NAME": str, # Job name
"USER": str, # Username
"ACCOUNT": str, # Account/project
"PARTITION": str, # Partition name
"QOS": str, # Quality of Service

# Job State
"STATE": str, # Job state (RUNNING, PENDING, etc.)
"REASON": str, # Reason for pending state
"PRIORITY": float | None, # Job priority
"PENDING_RESOURCES": str, # Resources preventing job from running

# Resource Requests
"CPUS": int | None, # CPUs requested
"MIN_CPUS": int | None, # Minimum CPUs per node
"NODES": int | None, # Nodes requested
"GPUS_REQUESTED": int | None, # GPUs requested (from TRES-PER-NODE)
"MIN_MEMORY": int, # Memory requested (MB)

# TRES Allocations (for running jobs)
"TRES_CPU_ALLOCATED": int, # CPUs allocated
"TRES_GPUS_ALLOCATED": int, # GPUs allocated
"TRES_MEM_ALLOCATED": int, # Memory allocated (MB)
"TRES_NODE_ALLOCATED": int, # Nodes allocated
"TRES_BILLING_ALLOCATED": int, # Billing units allocated

# Time Information
"SUBMIT_TIME": str, # Job submission time (ISO 8601)
"ELIGIBLE_TIME": str, # Time job became eligible to run
"START_TIME": str, # Job start time (or estimated for pending)
"ACCRUE_TIME": str, # Time job started accruing priority
"TIME_USED": str, # Elapsed time (HH:MM:SS)
"TIME_LEFT": str, # Remaining time (HH:MM:SS)
"TIME_LIMIT": str, # Time limit (HH:MM:SS)
"PENDING_TIME": int | None, # Seconds job has been pending

# Node Assignment
"NODELIST": list[str] | None, # Assigned nodes (for running jobs)
"SCHEDNODES": list[str] | None, # Nodes under consideration for scheduling
"EXC_NODES": list[str] | None, # Excluded nodes

# Scheduling & Dependencies
"DEPENDENCY": str, # Job dependencies
"RESERVATION": str, # Reservation if applicable
"FEATURE": str, # Required node features
"REQUEUE": str, # Whether job can be requeued
"RESTARTCNT": int, # Number of restarts
"COMMENT": str, # Job comment
"COMMAND": str, # Job command/script
}

Command-Line Options

OptionTypeDefaultDescription
--clusterStringAuto-detectedCluster name for metadata enrichment
--sinkStringRequiredSink destination, see Exporters
--sink-optsMultiple-Sink-specific options
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--stdoutFlagFalseDisplay metrics to stdout in addition to logs
--heterogeneous-cluster-v1FlagFalseEnable per-partition metrics for heterogeneous clusters
--intervalInteger60Seconds between collection cycles
--onceFlagFalseRun once and exit (no continuous monitoring)
--retriesIntegerShared defaultRetry attempts on sink failures
--dry-runFlagFalsePrint to stdout instead of publishing to sink
--chunk-sizeIntegerShared defaultThe maximum size in bytes of each chunk when writing data to sink

Usage Examples

Basic Continuous Monitoring

# Monitor nodes and jobs every minute (default)
gcm slurm_job_monitor --sink graph_api --sink-opts scribe_category=slurm_realtime

High-Frequency Monitoring

# Check every 30 seconds
gcm slurm_job_monitor --interval 30 --sink graph_api

One-Time Snapshot

# Get current cluster state once
gcm slurm_job_monitor --once --sink stdout

Dry Run for Testing

# Test without publishing to production sink
gcm slurm_job_monitor --once --dry-run

File Output

# Save to local files
gcm slurm_job_monitor \
--once \
--sink file \
--sink-opts filepath=/tmp/slurm_data.json