Skip to main content

xid

Overview

Detects NVIDIA GPU hardware errors by searching dmesg for XID (eXtended ID) error codes. Classifies errors by severity using internal XID error code database: critical XIDs trigger CRITICAL state, non-critical XIDs trigger WARN state.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No XID errors found
WARN (1)Non-critical XID errors detected
WARN (1)Command execution failed
CRITICAL (2)Critical XID errors detected

Usage Examples

xid - Basic Check

health_checks check-syslogs xid [CLUSTER] app

xid - Extended Timeout

health_checks check-syslogs xid \
--timeout 60 \
[CLUSTER] \
app

xid - Debug Mode

health_checks check-syslogs xid \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app