Skip to main content

row_remap

Overview

Checks for pending or failed memory row remapping operations using nvmlDeviceGetRowRemapperHistogram(). Row remapping indicates memory defects requiring GPU reset or replacement.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No pending or failed remaps
CRITICAL (2)Pending remaps detected
CRITICAL (2)Failed remaps detected
UNKNOWN (3)Unable to query remap status

Usage Examples

Basic Check

health_checks check-nvidia-smi \
-c row_remap \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app