Skip to main content

gpu_mem_usage

Overview

Verifies GPU memory usage is below threshold using nvmlDeviceGetMemoryInfo(). Ensures proper cleanup validation and prevents memory leaks between job executions.

Command-Line Options

OptionTypeDefaultDescription
--gpu_mem_usage_thresholdInteger15Maximum GPU memory usage (MiB)
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All GPUs below threshold
CRITICAL (2)Any GPU exceeds threshold
UNKNOWN (3)NVML initialization failure

Usage Examples

gpu_mem_usage - Basic Check

health_checks check-nvidia-smi \
-c gpu_mem_usage \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

gpu_mem_usage - Epilog Cleanup Validation

health_checks check-nvidia-smi \
-c gpu_mem_usage \
--gpu_mem_usage_threshold 10 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
epilog