Skip to main content

row_remap_pending

Overview

Specifically validates that no pending row remaps exist. Pending remaps require GPU reset to complete and may impact job stability.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No pending remaps
CRITICAL (2)Pending remaps detected
UNKNOWN (3)Unable to query remap status

Usage Examples

Basic Check

health_checks check-nvidia-smi \
-c row_remap_pending \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app