Skip to main content

Adding New Health Check

GCM Health Checks are designed to be easily extensible. Each check follows the same patterns, so adding a new one is mostly about copying the right structure and plugging in your logic. This guide walks through each step.

For a deep dive into the boilerplate and annotated code examples, see the Deep Dive.

1. Create the check file

Create a new file under gcm/health_checks/checks/. The naming convention is check_<name>.py.

# gcm/health_checks/checks/check_example.py
import logging
import socket
import sys
from collections.abc import Collection
from contextlib import ExitStack
from dataclasses import dataclass
from typing import Optional, Protocol

import click
import gni_lib
from gcm.health_checks.check_utils.output_context_manager import OutputContext
from gcm.health_checks.check_utils.telem import TelemetryContext
from gcm.health_checks.click import common_arguments, telemetry_argument, timeout_argument
from gcm.health_checks.subprocess import handle_subprocess_exception, shell_command, ShellCommandOut
from gcm.health_checks.types import CHECK_TYPE, CheckEnv, ExitCode, LOG_LEVEL
from gcm.monitoring.click import heterogeneous_cluster_v1_option
from gcm.monitoring.features.gen.generated_features_healthchecksfeatures import FeatureValueHealthChecksFeatures
from gcm.monitoring.slurm.derived_cluster import get_derived_cluster
from gcm.monitoring.utils.monitor import init_logger
from gcm.schemas.health_check.health_check_name import HealthCheckName
from typeguard import typechecked

2. Define the Protocol and implementation

Define a Protocol class that describes what external commands your check needs, then implement it in a @dataclass. This enables dependency injection for testing.

class ExampleCheck(CheckEnv, Protocol):
"""Protocol for the example check."""
def get_example_data(
self, timeout_secs: int, logger: logging.Logger
) -> ShellCommandOut: ...


@dataclass
class ExampleCheckImpl:
"""Production implementation that runs the actual command."""
cluster: str
type: str
log_level: str
log_folder: str

def get_example_data(
self, timeout_secs: int, logger: logging.Logger
) -> ShellCommandOut:
cmd = "your-command --flags"
logger.info("Running command '%s'", cmd)
return shell_command(cmd, timeout_secs)

For piped commands (e.g. dmesg | grep ...), use piped_shell_command() from gcm/health_checks/subprocess.py and return PipedShellCommandOut instead.

3. Write the processing function

Separate command output parsing from the Click command itself. This keeps the logic unit-testable without invoking the CLI:

def process_example_output(output: str, error_code: int) -> tuple[ExitCode, str]:
if error_code > 0:
return (
ExitCode.WARN,
f"Command FAILED. error_code: {error_code} output: {output}",
)
# Your check logic here
if "error_condition" in output:
return ExitCode.CRITICAL, "Error detected: ..."
return ExitCode.OK, "No errors found."

Exit codes follow the Nagios Plugin API: OK (0), WARN (1), CRITICAL (2), UNKNOWN (3).

4. Implement the Click command

Use the standard decorator stack and boilerplate. For a single check, use @click.command(). For a group with sub-commands (like check_syslogs), use @click.group() and @group.command().

@click.command()
@common_arguments # --cluster, --type, --log-level, --log-folder
@timeout_argument # --timeout (default: 300s)
@telemetry_argument # --sink, --sink-opt, --verbose-out
@heterogeneous_cluster_v1_option
@click.pass_obj # enables object injection for tests
@typechecked
def check_example(
obj: Optional[ExampleCheck],
cluster: str,
type: CHECK_TYPE,
log_level: LOG_LEVEL,
log_folder: str,
timeout: int,
sink: str,
sink_opts: Collection[str],
verbose_out: bool,
heterogeneous_cluster_v1: bool,
) -> None:
"""One-line description of what this check does."""
node: str = socket.gethostname()
logger, _ = init_logger(
logger_name=type,
log_dir=os.path.join(log_folder, type + "_logs"),
log_name=node + ".log",
log_level=getattr(logging, log_level),
)
try:
gpu_node_id = gni_lib.get_gpu_node_id()
except Exception as e:
gpu_node_id = None
logger.warning(f"Could not get gpu_node_id, likely not a GPU host: {e}")

derived_cluster = get_derived_cluster(
cluster=cluster,
heterogeneous_cluster_v1=heterogeneous_cluster_v1,
data={"Node": node},
)

if obj is None:
obj = ExampleCheckImpl(cluster, type, log_level, log_folder)

exit_code = ExitCode.UNKNOWN
msg = ""
with ExitStack() as s:
s.enter_context(
TelemetryContext(
sink=sink, sink_opts=sink_opts, logger=logger,
cluster=cluster, derived_cluster=derived_cluster,
type=type, name=HealthCheckName.YOUR_CHECK.value,
node=node, get_exit_code_msg=lambda: (exit_code, msg),
gpu_node_id=gpu_node_id,
)
)
s.enter_context(
OutputContext(type, HealthCheckName.YOUR_CHECK, lambda: (exit_code, msg), verbose_out)
)

# Killswitch: allows disabling the check remotely
ff = FeatureValueHealthChecksFeatures()
if ff.get_healthchecksfeatures_disable_your_check():
exit_code = ExitCode.OK
msg = f"{HealthCheckName.YOUR_CHECK.value} is disabled by killswitch."
logger.info(msg)
sys.exit(exit_code.value)

try:
result: ShellCommandOut = obj.get_example_data(timeout, logger)
except Exception as e:
result = handle_subprocess_exception(e)

exit_code, msg = process_example_output(result.stdout, result.returncode)
logger.info(f"exit code {exit_code}: {msg}")

sys.exit(exit_code.value)

5. Register the check

Two files need to be updated:

a) Add the import in gcm/health_checks/checks/__init__.py:

from gcm.health_checks.checks.check_example import check_example

__all__ = [
...,
"check_example",
]

b) Add the command in gcm/health_checks/cli/health_checks.py:

list_of_checks: List[click.core.Command] = [
...,
checks.check_example,
]

c) Add the check name to the HealthCheckName enum:

class HealthCheckName(Enum):
...
YOUR_CHECK = "your check"

6. Add the killswitch feature flag

Every check must have a killswitch that allows disabling it remotely without a code deploy. Add a new boolean field to gcm/monitoring/features/feature_definitions/health_checks_features.py:

class HealthChecksFeatures:
...
disable_your_check: bool

After adding the field, regenerate the feature value class and format it:

python bin/generate_features.py
ufmt format gcm

This generates the FeatureValueHealthChecksFeatures class with a get_healthchecksfeatures_disable_your_check() method, which you use in the Click command (see step 4). The killswitch pattern in the command body should be:

ff = FeatureValueHealthChecksFeatures()
if ff.get_healthchecksfeatures_disable_your_check():
exit_code = ExitCode.OK
msg = f"{HealthCheckName.YOUR_CHECK.value} is disabled by killswitch."
logger.info(msg)
sys.exit(exit_code.value)

Killswitch tests are centralized in test_killswitches.py — add your check there as well.

7. Write tests

Create gcm/tests/health_checks_tests/test_check_example.py. Tests follow a consistent pattern:

a) Define a fake implementation that returns pre-built output instead of running real commands:

@dataclass
class FakeExampleCheckImpl:
example_out: ShellCommandOut

cluster = "test cluster"
type = "prolog"
log_level = "INFO"
log_folder = "/tmp"

def get_example_data(
self, _timeout_secs: int, _logger: logging.Logger
) -> ShellCommandOut:
return self.example_out

b) Create test data using FakeShellCommandOut from gcm.tests.fakes:

from gcm.tests.fakes import FakeShellCommandOut

ok_output = FakeShellCommandOut([], 0, "all good")
error_output = FakeShellCommandOut([], 0, "error_condition found")
cmd_failed = FakeShellCommandOut([], 1, "command error")

c) Use pytest.fixture with indirect and @pytest.mark.parametrize for comprehensive scenario coverage:

@pytest.fixture
def example_tester(request: pytest.FixtureRequest) -> FakeExampleCheckImpl:
return FakeExampleCheckImpl(request.param)


@pytest.mark.parametrize(
"example_tester, expected",
[
(ok_output, (ExitCode.OK, "No errors found.")),
(error_output, (ExitCode.CRITICAL, "Error detected")),
(cmd_failed, (ExitCode.WARN, "FAILED")),
],
indirect=["example_tester"],
)
def test_check_example(
tmp_path: Path,
example_tester: FakeExampleCheckImpl,
expected: tuple[ExitCode, str],
) -> None:
runner = CliRunner(mix_stderr=False)
result = runner.invoke(
check_example,
f"fair_cluster prolog --log-folder={tmp_path} --sink=do_nothing",
obj=example_tester,
)
assert result.exit_code == expected[0].value
assert expected[1] in result.output

d) Also test the processing function directly for fast, focused unit tests:

class TestProcessExampleOutput:
def test_ok(self) -> None:
exit_code, msg = process_example_output("all good", 0)
assert exit_code == ExitCode.OK

def test_error(self) -> None:
exit_code, msg = process_example_output("error_condition found", 0)
assert exit_code == ExitCode.CRITICAL

8. Add website documentation

Create a documentation page for your check under website/docs/GCM_Health_Checks/health_checks/. A template is available at check-example.md — copy it and fill in your check's details.

The page should include:

  • Overview: What the check does and what system aspect it monitors
  • Requirements (if any): External tools, packages, or hardware needed
  • Command-Line Options: Table of check-specific options (common options are inherited)
  • Exit Conditions: Table mapping exit codes to specific conditions
  • Usage Examples: Basic and telemetry-enabled invocation examples

For checks that are part of a group (sub-commands), create a folder instead of a single file (e.g. check-syslogs/ with a README.md and one page per sub-command).

9. Verify

Run the full validation suite before submitting your PR:

nox -s format    # ufmt/usort formatting
nox -s lint # flake8 linting
nox -s tests # pytest unit tests
nox -s typecheck # mypy type checking

How to test a new health check

  1. Unit tests: Follow step 7 above. Run with nox -s tests or directly: pytest gcm/tests/health_checks_tests/test_check_example.py -v
  2. Cluster execution: Deploy the check and run it with --sink=do_nothing to verify it works against real system commands. Check the log files for execution details.

Quick reference

WhatWhere
All checksgcm/health_checks/checks/
Common decoratorsgcm/health_checks/click.py
Subprocess utilitiesgcm/health_checks/subprocess.py
Exit codes & typesgcm/health_checks/types.py
Check name enumgcm/schemas/health_check/health_check_name.py
Feature flags (killswitches)gcm/monitoring/features/feature_definitions/health_checks_features.py
CLI entry pointgcm/health_checks/cli/health_checks.py
Check documentationwebsite/docs/GCM_Health_Checks/health_checks/
Documentation templatecheck-example.md
Test fakesgcm/tests/fakes.py
Output utilitiesgcm/health_checks/check_utils/output_utils.py
Telemetry contextgcm/health_checks/check_utils/telem.py
Deep diveHealth Checks Deep Dive
Killswitch testsgcm/tests/health_checks_tests/test_killswitches.py