Skip to main content

sacctmgr_qos

Overview

Collects Quality of Service (QoS) configuration data from SLURM using sacctmgr and publishes it at regular intervals. Provides daily snapshots of QoS resource limits (CPU, memory, GPU, wall time), priorities and preemption settings, usage limits per user/group, grace periods and runtime constraints, and QoS hierarchies. Enables tracking of QoS configuration changes over time.

Data Type: DataType.LOG, Schema: SacctmgrQosPayload

Execution Scope

Single node in the cluster (typically head node).

Output Schema

SacctmgrQosPayload

Published with DataType.LOG:

{
"ds": str, # Collection date (YYYY-MM-DD in Pacific time)
"cluster": str, # Cluster identifier
"derived_cluster": str, # Sub-cluster (same as cluster if not `--heterogeneous-cluster-v1`)
"sacctmgr_qos": { # Dictionary of QoS attributes
"Name": str, # QoS name
"Priority": str, # Job priority
"GraceTime": str, # Grace period before preemption
"Preempt": str, # QoS that can be preempted
"PreemptExemptTime": str, # Time before job can be preempted
"PreemptMode": str, # Preemption mode (cancel, requeue, suspend)
"Flags": str, # QoS flags
"UsageThres": str, # Usage threshold
"UsageFactor": str, # Usage factor for fair-share
"GrpTRES": str, # Group TRES limits
"GrpTRESMins": str, # Group TRES-minutes limits
"GrpTRESRunMins": str, # Group running TRES-minutes limits
"GrpJobs": str, # Max concurrent jobs per group
"GrpSubmit": str, # Max submitted jobs per group
"GrpWall": str, # Max wall time per group
"MaxTRES": str, # Max TRES per job
"MaxTRESMins": str, # Max TRES-minutes per job
"MaxTRESPerNode": str, # Max TRES per node
"MaxJobs": str, # Max concurrent jobs per user
"MaxSubmit": str, # Max submitted jobs per user
"MaxWall": str, # Max wall time per job
"MinTRES": str, # Min TRES per job
# Additional fields depending on SLURM version
}
}

Note: The exact fields in sacctmgr_qos dictionary depend on the SLURM version. The collector dynamically parses the header line to determine available fields.

Command-Line Options

OptionTypeDefaultDescription
--clusterStringAuto-detectedCluster name for metadata enrichment
--sinkStringRequiredSink destination, see Exporters
--sink-optsMultiple-Sink-specific options
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--stdoutFlagFalseDisplay metrics to stdout in addition to logs
--heterogeneous-cluster-v1FlagFalseEnable per-partition metrics for heterogeneous clusters
--intervalInteger300Seconds between collection cycles (5 minutes)
--onceFlagFalseRun once and exit (no continuous monitoring)
--retriesIntegerShared defaultRetry attempts on sink failures
--dry-runFlagFalsePrint to stdout instead of publishing to sink
--chunk-sizeIntegerShared defaultThe maximum size in bytes of each chunk when writing data to sink.

Usage Examples

Basic Daily Collection

gcm sacctmgr_qos --sink otel --sink-opts "log_resource_attributes={'attr_1': 'value1'}"

One-Time Snapshot

gcm sacctmgr_qos --once --sink stdout

Custom Collection Interval

# Collect every 6 hours
gcm sacctmgr_qos --interval 21600 --sink graph_api

File Output

gcm sacctmgr_qos \
--once \
--sink file \
--sink-opts filepath=/tmp/qos_data.jsonl