Skip to main content

check_service

System service and package version validation suite, plus Slurm cluster health monitoring. All checks are accessed as subcommands under the check_service CLI group.

Available Health Checks

CheckPurpose
service-statusVerify systemd service status
package-versionValidate RPM package versions
slurmctld-countVerify minimum slurmctld daemons are reachable
node-slurm-stateValidate node can accept Slurm jobs
cluster-availabilityMonitor percentage of nodes in unhealthy states

Quick Start

# Check service status
health_checks check_service service_status --cluster my_cluster --type health_check --service slurmd

# Verify package version
health_checks check_service package_version --cluster my_cluster --type health_check --package slurm --version 21.08.8-1.el8

# Controller daemon count check
health_checks check_service slurmctld_count --cluster my_cluster --type health_check --slurmctld-count 2

# Node state check
health_checks check_service node_slurm_state --cluster my_cluster --type health_check

# Cluster availability check
health_checks check_service cluster_availability --cluster my_cluster --type health_check