Skip to main content

check_pci

Overview

Validates PCI device presence and link integrity against a hardware manifest. Ensures PCIe devices (GPUs, NICs, NVMe drives, etc.) are properly seated, detected, and operating at expected speeds and widths.

Purpose

Detects PCIe hardware issues including:

  • Missing devices (unseated cards, hardware failure)
  • Degraded PCIe links (reduced speed or width)
  • Topology changes affecting device enumeration
  • Hardware mismatches vs. expected configuration

Command-Line Options

OptionTypeDefaultDescription
--manifest_fileString/etc/manifest.jsonPath to hardware manifest file
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Manifest File Format

The manifest file (/etc/manifest.json) contains expected hardware configuration:

{
"pci": {
"0000:17:00.0": {
"type": "GPU",
"dev": "NVIDIA A100",
"zone": "GPU",
"slot": "GPU0",
"link_speed": ["16 GT/s PCIe", "8 GT/s PCIe"],
"link_width": 16,
"topology_critical": false
},
"0000:65:00.0": {
"type": "NIC",
"dev": "Mellanox ConnectX-6",
"zone": "Network",
"slot": "NIC0",
"link_speed": ["16 GT/s PCIe"],
"link_width": 16,
"topology_critical": true
}
}
}

Manifest Fields

FieldTypeDescription
typeString (optional)Device category (GPU, NIC, NVMe, etc.)
devStringDevice model/name
zoneStringLogical grouping (GPU, Network, Storage)
slotStringHuman-readable identifier
link_speedList[String]Acceptable PCIe speeds (Gen3, Gen4, etc.)
link_widthIntegerExpected number of PCIe lanes
topology_criticalBoolean (optional)Stop checking if this device missing (affects enumeration)

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All devices present and healthy
WARN (1)Manifest file not found
WARN (1)Exception during check
CRITICAL (2)Device missing
CRITICAL (2)Degraded link
CRITICAL (2)Topology-critical device missing

Sysfs Paths

PCIe information read from Linux sysfs:

/sys/class/pci_bus/{domain:bus}/device/{domain:bus:device.function}/
├── current_link_speed # e.g., "16 GT/s PCIe" (Gen4), "8 GT/s PCIe" (Gen3)
└── current_link_width # e.g., "16", "8", "4", "1"

Example: For PCI slot 0000:17:00.0:

  • Bus: 0000:17
  • Device path: /sys/class/pci_bus/0000:17/device/0000:17:00.0/

Usage Examples

Basic Validation

health_checks check-pci [CLUSTER] app

Custom Manifest Location

health_checks -pci --manifest_file /opt/config/hw_manifest.json [CLUSTER] app

With Telemetry

health_checks check-pci \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
--verbose-out \
[CLUSTER] \
app

Debug Mode

health_checks check-pci \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app