GCM Collectors Documentation
This directory contains documentation for all GCM monitoring collectors. Collectors are CLI tools that gather metrics and data from various sources (SLURM, GPUs, storage, etc.) and publish them to configured sinks.
Collectors
- nvml_monitor - Collects GPU metrics using NVIDIA NVML library
- sacct_backfill - Backfills historical job data in time-chunked batches
- sacct_backfill_server - Coordination server for multi-cluster backfills
- sacct_publish - Transforms and publishes sacct output to sinks
- sacct_running - Continuously monitors running jobs
- sacct_wrapper - Wrapper for strict time-bounded sacct queries
- sacctmgr_qos - Collects Quality of Service configurations
- sacctmgr_user - Collects user account information and associations
- scontrol - Collects partition configuration
- scontrol_config - Collects cluster-wide configuration
- slurm_job_monitor - Real-time node and job monitoring
- slurm_monitor - Comprehensive cluster-wide metrics aggregation
Common Concepts
Sinks/Exporters
Collectors support pluggable sinks via the --sink and --sink-opts options:
file: Local file outputstdout: Console outputotel: OTLP-compatible backends
Check out Exporters.
Common CLI Options
All collectors share these standard options:
--cluster: Cluster identifier--sink: Output destination--sink-opts: Sink-specific configuration--interval: Seconds between collection cycles--once: Run once and exit (vs. continuous loop)--log-level: Logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)--log-folder: Directory for log files--dry-run: Test mode without publishing data--chunk-size: The maximum size in bytes of each chunk when writing data to sink.--retries: Number of retry attempts on failure
Data Collection Loop
Most collectors use run_data_collection_loop() which provides:
- Interval-based scheduling
- Error handling and retries
- Graceful shutdown
- Logging integration
Schema Validation
Data payloads use typed dataclasses for validation:
DevicePlusJobMetrics,HostMetrics(nvml_monitor)Sacct,SacctmgrQosPayload,SacctmgrUserPayload(SLURM accounting)Scontrol,ScontrolConfig(SLURM control)NodeData,SLURMLog(SLURM monitoring)
Adding a New Collector
Check out Adding New Collector.