Skip to main content

sacct_backfill

Overview

Orchestrates historical SLURM job data collection by partitioning large time ranges into manageable chunks and systematically backfilling them through sacct_wrapper and sacct_publish. Supports parallel processing, retry of failed chunks, and rendezvous synchronization for multi-cluster coordination when writing to immutable storage.

Data Type: N/A (orchestrator that invokes other collectors) Schemas: N/A (orchestrator)

Execution Scope

Single node in the cluster.

Command-Line Options

OptionTypeDefaultDescription
--clusterStringAuto-detectedCluster name for metadata enrichment
--sacct-timeoutInteger120Timeout in seconds for each sacct call
--publish-timeoutInteger120Timeout in seconds for each sacct_publish call
--concurrentlyInteger1Maximum number of publishes that can occur concurrently (0 = unlimited).
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--stdoutFlagFalseDisplay metrics to stdout in addition to logs
--heterogeneous-cluster-v1FlagFalseEnable per-partition metrics for heterogeneous clusters
--intervalInteger300Seconds between collection cycles (5 minutes)
--onceFlagFalseRun once and exit (no continuous monitoring)
--retriesIntegerShared defaultRetry attempts on sink failures
--dry-runFlagFalsePrint to stdout instead of publishing to sink
--chunk-sizeIntegerShared defaultThe maximum size in bytes of each chunk when writing data to sink.
--sleepInteger10Seconds to wait between chunks (serial mode only)
--rendezvous-hostIP AddressNoneHost running rendezvous server. Synchronize backfill processes across multiple clusters (see sacct_backfill_server
--rendezvous-portInteger50000Port of rendezvous server
--authkeyUUIDRequired with hostAuthentication key from server
--rendezvous-timeoutInteger60Seconds to wait for synchronization

Subcommands

new - Start New Backfill

Partition a time range and backfill all chunks.

Options

OptionTypeDefaultDescription
-s, --startString3 hours agoStart time (parsed by GNU date -d)
-e, --endStringnowEnd time (parsed by GNU date -d)
--stepInteger1Chunk size in hours
PUBLISH_CMDArgumentsRequiredCommand to publish each chunk (after --). See sacct_publish

from_file - Retry Failed Chunks

Backfill specific time intervals from a CSV file. Used to retry chunks that failed in previous runs.

Options

OptionTypeDefaultDescription
--intervalsFilestdinCSV file with (start, end) pairs
PUBLISH_CMDArgumentsRequiredCommand to publish each chunk (after --). See sacct_publish

Usage Examples

Basic Backfill

# Last 24 hours, one chunk at a time
gcm sacct_backfill --once new \
-s "1 day ago" \
-e "now" \
-- \
gcm sacct_publish --sink stdout

Large Historical Backfill

# 1 year of data in 2-hour chunks, 5 concurrent
gcm sacct_backfill --once new \
-s "jan 1 2023" \
-e "dec 31 2023" \
--step 2 \
--concurrently 5 \
-- \
gcm sacct_publish \
--sink otel \
-o "log_resource_attributes={'key': 'val'}" \
2> backfill_errors.log

Continuous Backfill

# Run every hour, backfill last 3 hours
gcm sacct_backfill new \
-s "3 hours ago" \
-e "now" \
--interval 3600 \
-- \
gcm sacct_publish --sink otel

Multi-Cluster Synchronized Backfill

# Start server (on coordinator node)
gcm sacct_backfill_server --nprocs 3
# Note the authkey output

# On cluster1
gcm sacct_backfill --once new \
--rendezvous-host coordinator.example.com \
--rendezvous-port 50000 \
--authkey <UUID> \
--cluster cluster1 \
-s "7 days ago" -e "now" \
-- \
gcm sacct_publish --sink graph_api

# On cluster2 and cluster3 (same command with different clusters)
# All three will process chunks in lockstep