Skip to main content

running_procs

Overview

Checks for processes occupying GPUs using nvmlDeviceGetComputeRunningProcesses(). Ensures GPUs are idle and available for new workloads.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No active processes found
OK (0)Only zombie PIDs detected
CRITICAL (2)Real processes found
UNKNOWN (3)NVML query failed

Usage Examples

running_procs - Basic Check

health_checks check-nvidia-smi \
-c running_procs \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app