Skip to main content

node-slurm-state

Overview

Checks if the node is in a Slurm state that allows it to accept jobs. Validates individual node health and availability for workload execution.

Command-Line Options

OptionTypeDefaultDescription
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeCondition
OK (0)Feature flag disabled (killswitch active)
OK (0)Node in good state and can accept jobs
WARN (1)Node in critical state, undefined state, or command failed

Usage Examples

node-slurm-state - Basic Check

health_checks check-service node-slurm-state \
[CLUSTER] \
app

node-slurm-state - With Telemetry

health_checks check-service node-slurm-state \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

node-slurm-state - Debug Mode

health_checks check-service node-slurm-state \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app

node-slurm-state - With Timeout

health_checks check-service node-slurm-state \
--timeout 30 \
[CLUSTER] \
app