Skip to main content

check-module

Overview

Verifies that specified kernel modules are loaded on the system. Critical for ensuring required drivers (e.g., NVIDIA GPU drivers, InfiniBand modules) are available before workload execution.

Requirements

  • Linux

Commands Used

lsmod | grep {module_name} | wc -l

Command-Line Options

OptionTypeDefaultDescription
--module / -mStringRequiredModule name(s) to check (can be specified multiple times)
--mod_countInteger1Expected occurrence count in lsmod output
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)All modules appear >= expected count
WARN (1)Command execution failed
CRITICAL (2)Any module count < expected

Usage Examples

Single Module Check

# Check if nvidia module is loaded
health_checks check-node check-module \
--module nvidia \
[CLUSTER] \
app

Multiple Modules

# Check multiple GPU-related modules
health_checks check-node check-module \
--module nvidia \
--module nvidia_uvm \
--module nvidia_drm \
[CLUSTER] \
app

With Custom Count Threshold

# Expect 8 instances of ib_core module
health_checks check-node check-module \
--module ib_core \
--mod_count 8 \
[CLUSTER] \
app

Multiple Modules with Different Counts

# Check modules with specific counts
health_checks check-node check-module \
--module nvidia --mod_count 8 \
--module ib_core --mod_count 1 \
[CLUSTER] \
app

With Telemetry

health_checks check-node check-module \
--module nvidia \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
--verbose-out \
[CLUSTER] \
app

Debug Mode

health_checks check-node check-module \
--module nvidia \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app