scontrol

Overview

Collects SLURM partition configuration data using scontrol show partition and publishes individual partition records at regular intervals. Provides monitoring of partition-level configuration including resource limits (CPUs, memory, nodes, GPUs), TRES (Trackable RESources) allocations and billing weights, partition priority settings, QoS assignments, preempt modes, and node membership per partition.

Returns multiple records (one per partition), whereas scontrol_config returns a single aggregated record for cluster-wide configuration.

Data Type: DataType.LOG, Schema: Scontrol

Execution Scope

Single node in the cluster.

Output Schema

Scontrol (Per Partition)

Published with DataType.LOG:

{
    # Cluster Identification
    "cluster": str,                      # Cluster identifier
    "derived_cluster": str,              # Derived cluster for heterogeneous clusters

    # Partition Identity
    "Partition": str,                    # Partition name (from PartitionName)
    "Nodes": str,                        # Node list expression (e.g., "node[001-010]")

    # Resource Limits
    "MaxNodes": int,                     # Maximum nodes per job
    "TotalCPUs": int | None,             # Total CPUs in partition
    "TotalNodes": int | None,            # Total nodes in partition

    # TRES Allocations (extracted from TRES field)
    "TresCPU": int,                      # CPU allocation
    "TresMEM": int,                      # Memory allocation (bytes)
    "TresNODE": int,                     # Node allocation
    "TresBILLING": int,                  # Billing units
    "TresGRESGPU": int,                  # GPU allocation

    # TRES Billing Weights (from TRESBillingWeights)
    "TresBillingWeightCPU": int,         # CPU billing weight
    "TresBillingWeightMEM": int,         # Memory billing weight
    "TresBillingWeightGRESGPU": int,     # GPU billing weight

    # Priority Configuration
    "PriorityJobFactor": int | None,     # Job priority factor
    "PriorityTier": int | None,          # Partition priority tier
    "QoS": str,                          # Allowed QoS list
    "PreemptMode": str,                  # Preemption mode (OFF, CANCEL, etc.)
}

Command-Line Options

Option	Type	Default	Description
`--sink`	String	Required	Sink destination, see Exporters
`--sink-opts`	Multiple	-	Sink-specific configuration
`--stdout`	Flag	False	Display metrics to stdout in addition to logs
`--interval`	Integer	86400	Frequency in seconds to collect data (24 hours)
`--once`	Flag	False	Run once and exit
`--cluster`	String	Auto-detected	Cluster identifier
`--heterogeneous-cluster-v1`	Flag	False	Enable heterogeneous cluster support
`--log-level`	Choice	INFO	DEBUG, INFO, WARNING, ERROR, CRITICAL
`--log-folder`	String	`/var/log/fb-monitoring`	Log directory
`--retries`	Integer	3	Number of retries on failure
`--dry-run`	Flag	False	Run without sending data to sink
`--chunk-size`	Integer	1000	The maximum size in bytes of each chunk when writing data to sink

Usage Examples

Basic Daily Collection

gcm scontrol --sink otel --sink-opts "log_resource_attributes={'attr_1': 'value1'}"

One-Time Snapshot

gcm scontrol --once --sink stdout

Hourly Collection

# Monitor partition changes more frequently
gcm scontrol --interval 3600 --sink file --sink-opts filepath=/tmp/partitions.json

Overview​

Execution Scope​

Output Schema​

Scontrol (Per Partition)​

Command-Line Options​

Usage Examples​

Basic Daily Collection​

One-Time Snapshot​

Hourly Collection​