Getting Started
GCM GPU Metrics is an OpenTelemetry processor, called Slurm Processor, for enriching telemetry data with Slurm metadata.
Overview
The Slurm Processor is designed to enhance OpenTelemetry telemetry data (traces, metrics, and logs) with Slurm job information. It identifies the Slurm jobs associated with specific GPUs and adds relevant metadata such as job IDs, user names, partitions, and other Slurm-specific attributes to the telemetry data.
This processor is particularly useful in high-performance computing environments where correlating system telemetry with Slurm job information is important for monitoring.
We've found at Meta that this processor is mostly useful for Metrics data.
Metric exporting from a DCGM sample perspective:
Getting Started
Slurm Processor is a component of the OpenTelemetry Collector. To use it, you'll need to build a custom OpenTelemetry Collector.
- Add the below to the builder-config.yaml file:
...
processors:
- gomod:
path/to/gcm/slurmprocessor
...
- Run the following command to build the OpenTelemetry Collector binary:
./ocb --config builder-config.yaml
- Add the below to your collector-config.yaml file:
...
processors:
slurm:
# The number of seconds to cache the results for Slurm calls in memory
# Affects the number of misattributed Slurm metadata at the beginning and end of the job lifetime
# low values here could overwhelm slurmctld, be careful
cache_duration: 60
# Path to the file where it caches results
cache_filepath: '/tmp/slurmprocessor_cache.json'
# Boolean that decides whether or not it queries slurmctld
query_slurmctld: false
...
service:
pipelines:
metrics:
receivers:
- receiver1
processors:
- slurm # <-- Add this line
exporters:
- exporter1
...
Dependencies
- OpenTelemetry Collector
- shelper go package for Slurm metadata retrieval
License
slurmprocessor is licensed under the Apache 2.0 license.