Skip to main content

check-blockdev

Overview

Validates NVMe block device health by comparing SMART data against manifest specifications.

Expects a file with hardware information, like DGX_A100.json

Requirements

System Requirements

  • smartctl from smartmontools package
  • Access to /sys/block/ filesystem
  • Access to manifest file

Command-Line Options

OptionTypeDefaultDescription
--manifest_fileString/etc/manifest.jsonPath to hardware manifest file
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Validation Logic

Performs comprehensive checks for each NVMe device listed in manifest:

1. Health Log Status

  • Reads SMART health information log
  • Verifies log is available and readable
  • CRITICAL if health log is invalid

2. Device Size Validation

  • Compares actual device size against manifest specification
  • Uses sysfs: /sys/block/<device>/size
  • CRITICAL if sizes don't match

3. Lifetime Usage

  • Checks "Percentage Used" SMART attribute
  • CRITICAL if > 100% (lifetime exceeded)
  • WARN if > 80% (approaching end of life)

4. Spare Space Threshold

  • Monitors "Available Spare" vs "Available Spare Threshold"
  • CRITICAL if spare space < threshold

5. SMART Data Consistency

  • Validates read/write statistics are consistent
  • Checks data units read/written
  • CRITICAL if inconsistent or negative values

6. Device Identification

  • Verifies serial number and model information
  • Cross-references with manifest expectations

Manifest File Format

Expected JSON structure:

{
"blockdev": [
{
"device": "/dev/nvme0n1",
"size_bytes": 3840755982336,
"model": "Samsung SSD Model",
"serial": "S12345678"
}
]
}

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All health checks pass
WARN (1)Lifetime usage > 80%, command failures
CRITICAL (2)Health log invalid, size mismatch, lifetime > 100%, spare space low, bad SMART data

Usage Examples

Default manifest check

health_checks check-blockdev [CLUSTER] app

Custom manifest location

health_checks check-blockdev --manifest_file /etc/custom_manifest.json [CLUSTER] app

With timeout

health_checks check-blockdev --manifest_file /etc/manifest.json --timeout 60 [CLUSTER] app