Skip to main content

slurmctld-count

Overview

Verifies that the minimum number of Slurm controller daemons (slurmctld) are reachable from the node. Validates controller availability for cluster management operations.

Command-Line Options

OptionTypeDefaultDescription
--slurmctld-countIntegerRequiredMinimum number of slurmctld daemons that should be present
--timeoutInteger300Command execution timeout in seconds
--sinkStringdo_nothingTelemetry sink destination
--sink-optsMultiple-Sink-specific configuration
--verbose-outFlagFalseDisplay detailed output
--log-levelChoiceINFODEBUG, INFO, WARNING, ERROR, CRITICAL
--log-folderString/var/log/fb-monitoringLog directory
--heterogeneous-cluster-v1FlagFalseEnable heterogeneous cluster support

Exit Conditions

Exit CodeConditionMessage
OK (0)Feature flag disabled (killswitch active)Feature disabled by killswitch
OK (0)Reachable daemon count >= expected countSufficient slurmctld daemon count. Expected at least: {n} and found: {n}
WARN (1)Command execution failed or invalid outputError details with command output
CRITICAL (2)Reachable daemon count < expected countInsufficient slurmctld daemon count. Expected at least: {n} and found: {n}

Usage Examples

slurmctld-count - Primary Only

health_checks check-service slurmctld-count \
--slurmctld-count 1 \
[CLUSTER] \
app

slurmctld-count - Primary + Backup

health_checks check-service slurmctld-count \
--slurmctld-count 2 \
[CLUSTER] \
app

slurmctld-count - With Telemetry

health_checks check-service slurmctld-count \
--slurmctld-count 2 \
--sink otel \
--sink-opts "log_resource_attributes={'attr_1': 'value1'}" \
[CLUSTER] \
app

slurmctld-count - Debug Mode

health_checks check-service slurmctld-count \
--slurmctld-count 2 \
--log-level DEBUG \
--verbose-out \
[CLUSTER] \
app