spdl.pipeline.ProcessGroupStatsMonitor¶
- class ProcessGroupStatsMonitor(callback: Callable[[ProcessGroupResourceUsage], Awaitable[None]], interval: float = 60.0, mp_context: BaseContext | None = None)[source]¶
Background task that spawns a subprocess to collect per-process-group stats.
In multi-rank training, multiple ranks on the same host often share a single cgroup. External monitoring systems (e.g., dynolog) report metrics per cgroup, so they cannot attribute resource usage to individual ranks. This monitor solves that by collecting stats per process group — each rank typically runs as its own PGID, giving per-rank granularity even when ranks share a cgroup.
Sums CPU, memory (RSS, PSS, private), disk IO, and network bytes across all processes sharing this rank’s PGID.
Runs in a separate subprocess so the main Python runtime is not affected by GC pauses or CPU overhead from /proc scanning.
- Parameters:
callback – An async callable that receives a
ProcessGroupResourceUsagesnapshot each interval.interval – Collection interval in seconds (default 60).
mp_context – A
multiprocessingcontext (e.g. frommultiprocessing.get_context("forkserver")) used to spawn the monitor subprocess.None(default) uses the default context.
Methods
run()Spawn a daemon subprocess to collect process-group stats and wait for it.
- async run() None[source]¶
Spawn a daemon subprocess to collect process-group stats and wait for it.
The subprocess runs its own asyncio event loop that periodically reads
/procto gather CPU, memory, disk IO, and network stats for all processes in this process group, then invokes the user-provided callback with aProcessGroupResourceUsagesnapshot.This method polls the subprocess every 5 seconds until it exits or this coroutine is cancelled. On cancellation the subprocess is terminated (then killed if it does not exit within 5 seconds) and
CancelledErroris re-raised.