Skip to main content

vbios_mismatch

Overview

Verifies consistent VBIOS versions across all GPUs using nvmlDeviceGetVbiosVersion(). If --gpu_vbios is not specified, auto-detects the expected version from the first GPU and validates all other GPUs match.

Command-Line Options

OptionTypeDefaultDescription
--gpu_vbiosString""Expected VBIOS version (auto-detect from first GPU if empty)
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All GPUs have consistent VBIOS versions
CRITICAL (2)VBIOS version mismatch detected
UNKNOWN (3)Unable to retrieve VBIOS information

Usage Examples

vbios_mismatch - Auto-Detect Check

health_checks check-nvidia-smi \
-c vbios_mismatch \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

vbios_mismatch - Specify Expected Version

health_checks check-nvidia-smi \
-c vbios_mismatch \
--gpu_vbios "96.00.5E.00.01" \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app