Skip to main content

check-iblink

Overview

Validates InfiniBand link health by comparing runtime link state from sysfs against node configuration manifest, detecting hardware issues, link degradation, firmware mismatches, and binding problems.

Expects a file with hardware information, like DGX_A100.json

Requirements

  • InfiniBand Drivers: Mellanox/NVIDIA OFED or inbox kernel drivers
  • Sysfs: Kernel sysfs mounted at /sys
  • Manifest File: Valid JSON configuration at specified path
  • HCA Hardware: Mellanox/NVIDIA ConnectX adapters

Command-Line Options

OptionTypeDefaultDescription
--manifest_fileString/etc/manifest.jsonPath to node configuration manifest
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Manifest Format

Expected Structure

{
"pci": {
"0000:17:00.0": {
"slot": "slot0",
"dev": "ib0",
"type": "ib"
}
},
"ib": {
"link_rate": "200 Gb/sec (4X HDR)",
"firmware_version": ["20.28.1002", "20.28.1010"],
"interfaces": {
"ib0": {
"mlx": "mlx5_0",
"desc": "Mellanox ConnectX-6"
}
}
}
}

Validation Logic

Data Collection

For each InfiniBand PCI device:

  1. Read from sysfs:

    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/node_desc - Device description
    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/fw_ver - Firmware version
    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/ports/1/phys_state - Physical state
    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/ports/1/state - Link state
    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/ports/1/link_layer - Protocol layer
    • /sys/bus/pci/devices/{pci_id}/infiniband/{mlx}/ports/1/rate - Link rate
    • /sys/bus/pci/devices/{pci_id}/net/{ibdev}/operstate - Network operational state
  2. Compare against manifest: Validate each attribute against expected values

Issue Classification

Critical Issues (Exit Code 2)

IssueDescriptionExample
MISBINDPCI slot bound to incorrect IB interfaceslot0 is bound to ib1, expected ib0
MLX5_MISMATCHIB netdev and mlx5 device don't matchib0 is bound to mlx5_1, expected mlx5_0
MLX5_PROTOCOL_MISMATCHPort not in InfiniBand modeib0 is not presenting as an InfiniBand link
LINK_RATE_MISMATCHLink speed degradedib0 has degraded link 100 Gb/sec
LINK_NOT_UPPhysical link downib0 is not up, link state is 1: DOWN
LINK_BAD_STATELink up but not ACTIVEib0 is not up, link state is 3: ARMED
LINK_OPERSTATE_DOWNNetdev down despite IB ACTIVEib0 is ACTIVE, but the logical interface is down

Warning Issues (Exit Code 1)

IssueDescriptionExample
LINK_IN_INITLink in initialization stateib0 is not up, link state is 2: INIT
FIRMWARE_MISMATCHFirmware version unexpectedib0 has fw version 20.27.1000, expected [20.28.1002]

Aggregation

  • Collects issues across all InfiniBand links
  • Exit code determined by most severe issue found
  • Output includes summary: up: {count}, down: {count}

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All links healthy
WARN (1)Warning-level issues only
WARN (1)Manifest file read failure
CRITICAL (2)Any critical issue present

Usage Examples

Basic Validation

health_checks check-ib check-iblink \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Custom Manifest Location

health_checks check-ib check-iblink \
--manifest_file /opt/config/node_manifest.json \
--sink file --sink-opts filepath=/var/log/iblink_check.json \
[CLUSTER] \
app

Debug Mode

health_checks check-ib check-iblink \
--log-level DEBUG \
--verbose-out \
--sink stdout \
[CLUSTER] \
app