Skip to main content

running_procs_and_kill

Overview

Advanced process check with retry logic and optional force-kill capability for stuck processes. Implements multiple check attempts with configurable intervals before determining GPU availability status.

Command-Line Options

OptionTypeDefaultDescription
--running_procs_retry_countInteger3Number of retry attempts
--running_procs_intervalInteger3Seconds between retry attempts
--running_procs_force_killFlagFalseForce-kill processes if detected
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)No processes on first attempt
WARN (1)Processes cleared after retries
WARN (1)Processes killed successfully
CRITICAL (2)Processes remain after retries
CRITICAL (2)Force-kill failed
UNKNOWN (3)NVML query failed

Usage Examples

running_procs_and_kill - Basic with Retry

health_checks check-nvidia-smi \
-c running_procs_and_kill \
--running_procs_retry_count 3 \
--running_procs_interval 3 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

running_procs_and_kill - Force Kill

health_checks check-nvidia-smi \
-c running_procs_and_kill \
--running_procs_retry_count 5 \
--running_procs_interval 2 \
--running_procs_force_kill \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app