slurm_job_monitor

Overview

Collects dual-stream real-time data from SLURM: node metrics via sinfo and job metrics via squeue. Provides lightweight, high-frequency snapshots of cluster infrastructure state and active workload for real-time monitoring, capacity planning, resource utilization tracking, and bottleneck detection.

Data Type: DataType.LOG

Data Identifiers: DataIdentifier.NODE (node data), DataIdentifier.JOB (job data)

Schemas: NodeData (nodes), JobData (jobs)

The collector publishes two separate data streams with distinct DataIdentifier values for independent indexing, scaling, and targeted analysis.

Execution Scope

Single node in the cluster.

Output Schema

NodeData (Node Infrastructure)

Published with DataType.LOG and DataIdentifier.NODE:

{
    # Metadata
    "num_rows": int,                    # Total nodes in this collection
    "collection_unixtime": int,         # Collection timestamp (Unix epoch)
    "cluster": str,                     # Cluster name
    "derived_cluster": str,             # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)

    # Node Identification
    "NODE_NAME": str,                   # Node hostname
    "PARTITION": str,                   # Partition assignment

    # CPU Resources
    "CPUS_ALLOCATED": int,              # CPUs currently allocated to jobs
    "CPUS_IDLE": int,                   # CPUs available (idle)
    "CPUS_OTHER": int,                  # CPUs in other state
    "CPUS_TOTAL": int,                  # Total CPUs on node

    # Memory Resources
    "FREE_MEM": int | None,             # Free memory (MB)
    "MEMORY": int | None,               # Total memory (MB)

    # GPU Resources
    "NUM_GPUS": int,                    # Number of GPUs on node

    # Node State
    "STATE": str,                       # Node state (idle, allocated, down, etc.)
    "REASON": str,                      # Reason for down/drain state
    "USER": str,                        # User if node reserved
    "RESERVATION": str,                 # Reservation name if applicable

    # Node Metadata
    "TIMESTAMP": str,                   # Last sinfo update
    "ACTIVE_FEATURES": str,             # Active node features/constraints
}

JobData (Job Queue)

Published with DataType.LOG:

{
    # Metadata
    "collection_unixtime": int,         # Collection timestamp (Unix epoch)
    "cluster": str,                     # Cluster name
    "derived_cluster": str,             # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)

    # Job Identification
    "JOBID": str,                       # Job array ID
    "JOBID_RAW": str,                   # Raw job ID (includes array indices)
    "NAME": str,                        # Job name
    "USER": str,                        # Username
    "ACCOUNT": str,                     # Account/project
    "PARTITION": str,                   # Partition name
    "QOS": str,                         # Quality of Service

    # Job State
    "STATE": str,                       # Job state (RUNNING, PENDING, etc.)
    "REASON": str,                      # Reason for pending state
    "PRIORITY": float | None,           # Job priority
    "PENDING_RESOURCES": str,           # Resources preventing job from running

    # Resource Requests
    "CPUS": int | None,                 # CPUs requested
    "MIN_CPUS": int | None,             # Minimum CPUs per node
    "NODES": int | None,                # Nodes requested
    "GPUS_REQUESTED": int | None,       # GPUs requested (from TRES-PER-NODE)
    "MIN_MEMORY": int,                  # Memory requested (MB)

    # TRES Allocations (for running jobs)
    "TRES_CPU_ALLOCATED": int,          # CPUs allocated
    "TRES_GPUS_ALLOCATED": int,         # GPUs allocated
    "TRES_MEM_ALLOCATED": int,          # Memory allocated (MB)
    "TRES_NODE_ALLOCATED": int,         # Nodes allocated
    "TRES_BILLING_ALLOCATED": int,      # Billing units allocated

    # Time Information
    "SUBMIT_TIME": str,                 # Job submission time (ISO 8601)
    "ELIGIBLE_TIME": str,               # Time job became eligible to run
    "START_TIME": str,                  # Job start time (or estimated for pending)
    "ACCRUE_TIME": str,                 # Time job started accruing priority
    "TIME_USED": str,                   # Elapsed time (HH:MM:SS)
    "TIME_LEFT": str,                   # Remaining time (HH:MM:SS)
    "TIME_LIMIT": str,                  # Time limit (HH:MM:SS)
    "PENDING_TIME": int | None,         # Seconds job has been pending

    # Node Assignment
    "NODELIST": list[str] | None,       # Assigned nodes (for running jobs)
    "SCHEDNODES": list[str] | None,     # Nodes under consideration for scheduling
    "EXC_NODES": list[str] | None,      # Excluded nodes

    # Scheduling & Dependencies
    "DEPENDENCY": str,                  # Job dependencies
    "RESERVATION": str,                 # Reservation if applicable
    "FEATURE": str,                     # Required node features
    "REQUEUE": str,                     # Whether job can be requeued
    "RESTARTCNT": int,                  # Number of restarts
    "COMMENT": str,                     # Job comment
    "COMMAND": str,                     # Job command/script
}

Command-Line Options

Option	Type	Default	Description
`--cluster`	String	Auto-detected	Cluster name for metadata enrichment
`--sink`	String	Required	Sink destination, see Exporters
`--sink-opts`	Multiple	-	Sink-specific options
`--log-level`	Choice	INFO	DEBUG, INFO, WARNING, ERROR, CRITICAL
`--log-folder`	String	`/var/log/fb-monitoring`	Log directory
`--stdout`	Flag	False	Display metrics to stdout in addition to logs
`--heterogeneous-cluster-v1`	Flag	False	Enable per-partition metrics for heterogeneous clusters
`--interval`	Integer	60	Seconds between collection cycles
`--once`	Flag	False	Run once and exit (no continuous monitoring)
`--retries`	Integer	Shared default	Retry attempts on sink failures
`--dry-run`	Flag	False	Print to stdout instead of publishing to sink
`--chunk-size`	Integer	Shared default	The maximum size in bytes of each chunk when writing data to sink

Usage Examples

Basic Continuous Monitoring

# Monitor nodes and jobs every minute (default)
gcm slurm_job_monitor --sink graph_api --sink-opts scribe_category=slurm_realtime

High-Frequency Monitoring

# Check every 30 seconds
gcm slurm_job_monitor --interval 30 --sink graph_api

One-Time Snapshot

# Get current cluster state once
gcm slurm_job_monitor --once --sink stdout

Dry Run for Testing

# Test without publishing to production sink
gcm slurm_job_monitor --once --dry-run

File Output

# Save to local files
gcm slurm_job_monitor \
  --once \
  --sink file \
  --sink-opts filepath=/tmp/slurm_data.json

Overview​

Execution Scope​

Output Schema​

NodeData (Node Infrastructure)​

JobData (Job Queue)​

Command-Line Options​

Usage Examples​

Basic Continuous Monitoring​

High-Frequency Monitoring​

One-Time Snapshot​

Dry Run for Testing​

File Output​