Skip to main content

check-ib-counters

Overview

Monitors InfiniBand port error and throughput counters via sysfs, detecting runtime fabric degradation that silently hurts distributed training throughput (NCCL AllReduce, FSDP, etc.). This is the runtime complement to check-iblink: where check-iblink validates link presence, check-ib-counters detects performance-degrading conditions on active links.

Requirements

  • InfiniBand Drivers: Mellanox/NVIDIA OFED or inbox kernel drivers
  • Sysfs: Kernel sysfs mounted at /sys with counters at /sys/class/infiniband/{device}/ports/{port}/counters/

Monitored Counters

Error Counters

CounterDescription
symbol_errorPhysical layer symbol errors
link_error_recoveryLink error recovery events
link_downedLink down transitions
port_rcv_errorsMalformed packets received
port_rcv_remote_physical_errorsRemote physical receive errors
port_rcv_switch_relay_errorsSwitch relay errors on receive path
port_xmit_discardsOutbound packets discarded
port_xmit_constraint_errorsTransmit constraint violations
port_rcv_constraint_errorsReceive constraint violations
local_link_integrity_errorsLocal link integrity failures
excessive_buffer_overrun_errorsBuffer overrun events
VL15_droppedDropped VL15 (management) packets

Throughput Counters (informational)

CounterDescription
port_xmit_dataTotal data transmitted
port_rcv_dataTotal data received
port_xmit_packetsTotal packets transmitted
port_rcv_packetsTotal packets received

Command-Line Options

OptionTypeDefaultDescription
--warn-thresholdInteger0Total error count above which the check returns WARNING
--crit-thresholdInteger100Total error count above which the check returns CRITICAL
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All error counters within thresholds
WARN (1)No IB ports discovered
WARN (1)Total errors exceed --warn-threshold
CRITICAL (2)Total errors exceed --crit-threshold

Usage Examples

Basic Check

health_checks check-ib check-ib-counters [CLUSTER] app

With Custom Thresholds

health_checks check-ib check-ib-counters \
--warn-threshold 10 \
--crit-threshold 500 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Debug Mode

health_checks check-ib check-ib-counters \
--log-level DEBUG \
--verbose-out \
--sink stdout \
[CLUSTER] \
app