Skip to main content

check-ipmitool

Overview

Validates hardware health by reading and analyzing the System Event Log (SEL) using ipmitool or nvipmitool. Detects and reports critical hardware errors including power supply failures, ECC errors, PCIe errors, processor throttling, and BIOS corruption events. Automatically clears the SEL when it exceeds a configurable threshold.

Dependencies

System Requirements

  • ipmitool binary installed and in PATH
  • OR nvipmitool binary for NVIDIA-specific implementation
  • IPMI kernel module loaded (ipmi_devintf, ipmi_si)
  • BMC (Baseboard Management Controller) accessible
  • sudo privileges (default, can be disabled with --no-sudo)

Required Commands

# For standard IPMI
which ipmitool

# For NVIDIA implementation
which nvipmitool

# Verify IPMI devices
ls -l /dev/ipmi*

Command-Line Options

OptionTypeDefaultDescription
--ipmitool/--nvipmitoolFlag--ipmitoolUse ipmitool or nvipmitool binary
--sudo/--no-sudoFlag--sudoExecute command with sudo privileges
--clear_log_thresholdInteger40Clear SEL when line count exceeds threshold
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No errors detected
WARN (1)Command failed
WARN (1)Clear failed
CRITICAL (2)Hardware error detected

Critical hardware events: https://github.com/facebookresearch/gcm/blob/main/gcm/health_checks/checks/check_ipmitool.py#L102-L110

Usage Examples

Basic SEL Check

health_checks check-ipmitool check-sel \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Using nvipmitool Without Sudo

health_checks check-ipmitool check-sel \
--nvipmitool \
--no-sudo \
--sink stdout \
[CLUSTER] \
app

Custom Clear Threshold with Debug Logging

health_checks check-ipmitool check-sel \
--clear_log_threshold 100 \
--log-level DEBUG \
--verbose-out \
--sink stdout \
[CLUSTER] \
app