Skip to main content

gpu_retired_pages

Overview

Checks retired memory pages due to ECC errors don't exceed limits. Identifies GPUs with excessive memory errors that may indicate hardware degradation requiring maintenance.

Command-Line Options

OptionTypeDefaultDescription
--gpu_retired_pages_thresholdInteger10Maximum retired pages count
--gpu_numInteger8Expected number of GPUs
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All counts below threshold and no pending
CRITICAL (2)Threshold exceeded or pending > 0
UNKNOWN (3)NVML initialization failure

Usage Examples

gpu_retired_pages - Basic Check

health_checks check-nvidia-smi \
-c gpu_retired_pages \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

gpu_retired_pages - Custom Threshold

health_checks check-nvidia-smi \
-c gpu_retired_pages \
--gpu_retired_pages_threshold 5 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app