Skip to main content

GCM Collectors Documentation

This directory contains documentation for all GCM monitoring collectors. Collectors are CLI tools that gather metrics and data from various sources (SLURM, GPUs, storage, etc.) and publish them to configured sinks.

Collectors

Common Concepts

Sinks/Exporters

Collectors support pluggable sinks via the --sink and --sink-opts options:

  • file: Local file output
  • stdout: Console output
  • otel: OTLP-compatible backends

Check out Exporters.

Common CLI Options

All collectors share these standard options:

  • --cluster: Cluster identifier
  • --sink: Output destination
  • --sink-opts: Sink-specific configuration
  • --interval: Seconds between collection cycles
  • --once: Run once and exit (vs. continuous loop)
  • --log-level: Logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • --log-folder: Directory for log files
  • --dry-run: Test mode without publishing data
  • --chunk-size: The maximum size in bytes of each chunk when writing data to sink.
  • --retries: Number of retry attempts on failure

Data Collection Loop

Most collectors use run_data_collection_loop() which provides:

  • Interval-based scheduling
  • Error handling and retries
  • Graceful shutdown
  • Logging integration

Schema Validation

Data payloads use typed dataclasses for validation:

  • DevicePlusJobMetrics, HostMetrics (nvml_monitor)
  • Sacct, SacctmgrQosPayload, SacctmgrUserPayload (SLURM accounting)
  • Scontrol, ScontrolConfig (SLURM control)
  • NodeData, SLURMLog (SLURM monitoring)

Adding a New Collector

Check out Adding New Collector.