sacctmgr_qos

Overview

Collects Quality of Service (QoS) configuration data from SLURM using sacctmgr and publishes it at regular intervals. Provides daily snapshots of QoS resource limits (CPU, memory, GPU, wall time), priorities and preemption settings, usage limits per user/group, grace periods and runtime constraints, and QoS hierarchies. Enables tracking of QoS configuration changes over time.

Data Type: DataType.LOG, Schema: SacctmgrQosPayload

Execution Scope

Single node in the cluster (typically head node).

Output Schema

SacctmgrQosPayload

Published with DataType.LOG:

{
    "ds": str,                    # Collection date (YYYY-MM-DD in Pacific time)
    "cluster": str,               # Cluster identifier
    "derived_cluster": str,       # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)
    "sacctmgr_qos": {             # Dictionary of QoS attributes
        "Name": str,              # QoS name
        "Priority": str,          # Job priority
        "GraceTime": str,         # Grace period before preemption
        "Preempt": str,           # QoS that can be preempted
        "PreemptExemptTime": str, # Time before job can be preempted
        "PreemptMode": str,       # Preemption mode (cancel, requeue, suspend)
        "Flags": str,             # QoS flags
        "UsageThres": str,        # Usage threshold
        "UsageFactor": str,       # Usage factor for fair-share
        "GrpTRES": str,           # Group TRES limits
        "GrpTRESMins": str,       # Group TRES-minutes limits
        "GrpTRESRunMins": str,    # Group running TRES-minutes limits
        "GrpJobs": str,           # Max concurrent jobs per group
        "GrpSubmit": str,         # Max submitted jobs per group
        "GrpWall": str,           # Max wall time per group
        "MaxTRES": str,           # Max TRES per job
        "MaxTRESMins": str,       # Max TRES-minutes per job
        "MaxTRESPerNode": str,    # Max TRES per node
        "MaxJobs": str,           # Max concurrent jobs per user
        "MaxSubmit": str,         # Max submitted jobs per user
        "MaxWall": str,           # Max wall time per job
        "MinTRES": str,           # Min TRES per job
        # Additional fields depending on SLURM version
    }
}

Note: The exact fields in sacctmgr_qos dictionary depend on the SLURM version. The collector dynamically parses the header line to determine available fields.

Command-Line Options

Option	Type	Default	Description
`--cluster`	String	Auto-detected	Cluster name for metadata enrichment
`--sink`	String	Required	Sink destination, see Exporters
`--sink-opts`	Multiple	-	Sink-specific options
`--log-level`	Choice	INFO	DEBUG, INFO, WARNING, ERROR, CRITICAL
`--log-folder`	String	`/var/log/fb-monitoring`	Log directory
`--stdout`	Flag	False	Display metrics to stdout in addition to logs
`--heterogeneous-cluster-v1`	Flag	False	Enable per-partition metrics for heterogeneous clusters
`--interval`	Integer	300	Seconds between collection cycles (5 minutes)
`--once`	Flag	False	Run once and exit (no continuous monitoring)
`--retries`	Integer	Shared default	Retry attempts on sink failures
`--dry-run`	Flag	False	Print to stdout instead of publishing to sink
`--chunk-size`	Integer	Shared default	The maximum size in bytes of each chunk when writing data to sink.

Usage Examples

Basic Daily Collection

gcm sacctmgr_qos --sink otel --sink-opts "log_resource_attributes={'attr_1': 'value1'}"

One-Time Snapshot

gcm sacctmgr_qos --once --sink stdout

Custom Collection Interval

# Collect every 6 hours
gcm sacctmgr_qos --interval 21600 --sink graph_api

File Output

gcm sacctmgr_qos \
  --once \
  --sink file \
  --sink-opts filepath=/tmp/qos_data.jsonl

Overview​

Execution Scope​

Output Schema​

SacctmgrQosPayload​

Command-Line Options​

Usage Examples​

Basic Daily Collection​

One-Time Snapshot​

Custom Collection Interval​

File Output​