Skip to main content

clock_policy

Overview

Validates GPU application clock compliance against a configured policy using NVML telemetry. For each GPU, the check compares observed graphics and memory application clocks against expected values and classifies drift as OK, WARN, or CRITICAL.

Command-Line Options

OptionTypeDefaultDescription
--check clock_policyChoiceRequiredEnable the clock policy sub-check under check-nvidia-smi
--expected-graphics-freqInteger1155Expected graphics application clock (MHz)
--expected-memory-freqInteger1593Expected memory application clock (MHz)
--warn-delta-mhzInteger30Warn when absolute drift meets or exceeds this threshold
--critical-delta-mhzInteger75Critical when absolute drift meets or exceeds this threshold
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All GPUs are within thresholds
WARN (1)At least one GPU exceeds warn threshold
WARN (1)No GPUs detected
CRITICAL (2)At least one GPU exceeds critical threshold
UNKNOWN (3)Initialization or telemetry flow failed before final classification

Feature Flag (Killswitch)

Use the health checks features config to disable this check:

[HealthChecksFeatures]
disable_nvidia_smi_clock_policy = true

Usage Examples

Basic Policy Validation

health_checks check-nvidia-smi \
--check clock_policy \
--expected-graphics-freq 1155 \
--expected-memory-freq 1593 \
--warn-delta-mhz 30 \
--critical-delta-mhz 75 \
--sink do_nothing \
[CLUSTER] \
app

With Telemetry Sink

health_checks check-nvidia-smi \
--check clock_policy \
--expected-graphics-freq 1155 \
--expected-memory-freq 1593 \
--warn-delta-mhz 30 \
--critical-delta-mhz 75 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app