Skip to main content

check-hca

Overview

Validates the presence and count of Host Channel Adapters (HCAs) on InfiniBand-enabled compute nodes by querying ibv_devinfo and comparing against expected configuration.

Requirements

  • InfiniBand Drivers: Mellanox/NVIDIA OFED or inbox drivers
  • ibv_devinfo: Part of libibverbs-utils package
  • HCA Hardware: Mellanox/NVIDIA ConnectX InfiniBand adapters

Package Installation

# RHEL/CentOS
yum install libibverbs libibverbs-utils

# Ubuntu/Debian
apt-get install libibverbs1 ibverbs-utils

Command-Line Options

OptionTypeDefaultDescription
--expected-countIntegerRequiredExpected number of HCAs per node
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)HCA count matches expected
WARN (1)HCA count exceeds expected
WARN (1)Command execution failed
WARN (1)Exception during execution
CRITICAL (2)HCA count below expected
CRITICAL (2)No output detected
CRITICAL (2)No HCA found in output

Usage Examples

Basic Validation

health_checks check-hca \
--expected-count 4 \
--sink otel --sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

Debug Mode

health_checks check-hca \
--expected-count 8 \
--log-level DEBUG \
--verbose-out \
--sink stdout \
[CLUSTER] \
app