spdl.pipeline.ProcessGroupStatsMonitor

class ProcessGroupStatsMonitor(callback: Callable[[ProcessGroupResourceUsage], Awaitable[None]], interval: float = 60.0, mp_context: BaseContext | None = None)[source]

Background task that spawns a subprocess to collect per-process-group stats.

In multi-rank training, multiple ranks on the same host often share a single cgroup. External monitoring systems (e.g., dynolog) report metrics per cgroup, so they cannot attribute resource usage to individual ranks. This monitor solves that by collecting stats per process group — each rank typically runs as its own PGID, giving per-rank granularity even when ranks share a cgroup.

Sums CPU, memory (RSS, PSS, private), disk IO, and network bytes across all processes sharing this rank’s PGID.

Runs in a separate subprocess so the main Python runtime is not affected by GC pauses or CPU overhead from /proc scanning.

Parameters:
  • callback – An async callable that receives a ProcessGroupResourceUsage snapshot each interval.

  • interval – Collection interval in seconds (default 60).

  • mp_context – A multiprocessing context (e.g. from multiprocessing.get_context("forkserver")) used to spawn the monitor subprocess. None (default) uses the default context.

Methods

run()

Spawn a daemon subprocess to collect process-group stats and wait for it.

async run() None[source]

Spawn a daemon subprocess to collect process-group stats and wait for it.

The subprocess runs its own asyncio event loop that periodically reads /proc to gather CPU, memory, disk IO, and network stats for all processes in this process group, then invokes the user-provided callback with a ProcessGroupResourceUsage snapshot.

This method polls the subprocess every 5 seconds until it exits or this coroutine is cancelled. On cancellation the subprocess is terminated (then killed if it does not exit within 5 seconds) and CancelledError is re-raised.