Skip to main content

mce

Overview

Detects Machine Check Exception (MCE) errors by searching dmesg for MCE-related patterns. MCE errors indicate CPU or memory hardware issues that may affect system stability.

Lines are classified by severity using pattern matching against known Linux kernel MCE log formats (source of truth: mce_severity.py):

SeverityPatternsExamples
Critical[Hardware Error], Machine Check Exception, Uncorrected error, Fatal error, Processor context corruptmce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 9
WarningCorrected error, temperature above threshold, cpu clock throttled, CMCI stormmce: CPU0: 1 Corrected error(s) detected. Check CMCI storm count.
Informationaltemperature.*normal, CPU is offline, Disabling lockmce: CPU0: Core temperature/speed normal

Unrecognized mce: lines default to Warning for safety.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No MCE errors detected
OK (0)Only informational MCE events (e.g., temperature back to normal)
WARN (1)Command execution failed
WARN (1)Corrected errors or thermal throttling detected
CRITICAL (2)Hardware errors or uncorrected MCE events detected

Usage Examples

mce - Basic Check

health_checks check-syslogs mce [CLUSTER] app

mce - Extended Timeout

health_checks check-syslogs mce \
--timeout 60 \
[CLUSTER] \
app

mce - Debug Mode

health_checks check-syslogs mce \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app