Skip to main content

gpu_temperature

Overview

Ensures GPU temperatures remain below critical threshold using nvmlDeviceGetTemperature(). Prevents thermal throttling and hardware damage during workload execution.

Command-Line Options

OptionTypeDefaultDescription
--gpu_temperature_thresholdIntegerRequiredMaximum GPU temperature (°C)
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All GPUs below threshold
CRITICAL (2)Any GPU exceeds threshold
UNKNOWN (3)NVML initialization failure

Usage Examples

gpu_temperature - Basic Check

health_checks check-nvidia-smi \
-c gpu_temperature \
--gpu_temperature_threshold 80 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

gpu_temperature - Custom Threshold

health_checks check-nvidia-smi \
-c gpu_temperature \
--gpu_temperature_threshold 85 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app