Skip to main content

row_remap_failed

Overview

Ensures no failed row remap operations exist. Failed remaps indicate critical hardware issues requiring GPU replacement.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No failed remaps
CRITICAL (2)Failed remaps detected
UNKNOWN (3)Unable to query remap status

Usage Examples

Basic Check

health_checks check-nvidia-smi \
-c row_remap_failed \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app