Skip to main content

pcie-aer

Overview

Detects PCIe Advanced Error Reporting (AER) errors by searching dmesg for AER.*error patterns. PCIe AER errors can indicate GPU communication issues on the PCIe bus.

Lines are classified by severity using pattern matching against known Linux kernel PCIe AER log formats (source of truth: pcie_severity.py):

SeverityPatternsExamples
CriticalUncorrectable (Fatal), can't recoverpcieport 0000:00:03.0: AER: Uncorrectable (Fatal) error received
WarningUncorrectable (non-fatal)pcieport 0000:00:02.0: AER: Uncorrectable (Non-Fatal) error received
InformationalCorrected errorpcieport 0000:00:01.0: AER: Corrected error received: 0000:01:00.0

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No PCIe AER errors detected
OK (0)Only corrected PCIe AER errors (hardware auto-recovered)
WARN (1)Command execution failed
WARN (1)Uncorrectable non-fatal PCIe AER errors detected
CRITICAL (2)Fatal PCIe AER errors or unrecoverable device state

Usage Examples

pcie-aer - Basic Check

health_checks check-syslogs pcie-aer [CLUSTER] app

pcie-aer - Extended Timeout

health_checks check-syslogs pcie-aer \
--timeout 60 \
[CLUSTER] \
app

pcie-aer - Debug Mode

health_checks check-syslogs pcie-aer \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app