Skip to main content

clock_freq

Overview

Validates GPU and memory clock frequencies meet minimum application requirements using nvmlDeviceGetClockInfo(). Ensures GPUs operate at expected performance levels for workload execution.

Command-Line Options

OptionTypeDefaultDescription
--gpu_app_freqInteger1155Minimum GPU clock frequency (MHz)
--gpu_app_mem_freqInteger1593Minimum memory clock frequency (MHz)
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All GPUs meet frequency requirements
CRITICAL (2)Any GPU below threshold
UNKNOWN (3)NVML initialization failure

Usage Examples

clock_freq - Basic Check

health_checks check-nvidia-smi \
-c clock_freq \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

clock_freq - Custom Frequencies

health_checks check-nvidia-smi \
-c clock_freq \
--gpu_app_freq 1410 \
--gpu_app_mem_freq 1800 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app