spdl.pipeline.ProcessGroupResourceUsage

class ProcessGroupResourceUsage(pid: int, pgid: int, cpu_percent: float | None = None, rss_bytes: int | None = None, pss_bytes: int | None = None, private_bytes: int | None = None, disk_read_bytes: int | None = None, disk_write_bytes: int | None = None, num_procs: int | None = None, net_rx_bytes: int | None = None, net_tx_bytes: int | None = None)[source]

Snapshot of resource usage across all processes in the process group.

Collected periodically by ProcessGroupStatsMonitor and passed to the user-provided callback.

Memory metrics — three complementary views are provided:

  • RSS (Resident Set Size, from /proc/[pid]/stat): Total physical pages mapped by each process. Shared pages (shared libraries, CUDA context, etc.) are counted once per process that maps them, so summing RSS across a process group overcounts actual physical memory when pages are shared.

  • PSS (Proportional Set Size, from /proc/[pid]/smaps_rollup): Each shared page is divided equally among all processes that map it. Summing PSS across a process group gives the most accurate estimate of actual physical memory consumption without double-counting.

  • Private bytes (Private_Clean + Private_Dirty from /proc/[pid]/smaps_rollup): Only pages exclusive to each process — memory that would be freed if the process exited. This undercounts total usage because it excludes shared memory entirely, but isolates per-process allocations (model weights, activations, buffers).

The difference RSS Private approximates each process’s shared memory contribution. Reading smaps_rollup is more expensive than stat (the kernel walks page tables), but it is a single-file read per process so the overhead is modest.

Which metric to use:

  • Use PSS as the primary metric for comparing memory across configuration changes — it reflects the true physical memory cost of the process group without double-counting.

  • Use Private to isolate per-process allocations (model weights, activations, buffers) from shared overhead.

  • Use RSS as an upper-bound sanity check. When num_procs == 1, RSS equals PSS.

  • PSS Private can be derived in queries to see how much shared memory is attributed to this group.

Attributes

cpu_percent

CPU utilization as a percentage of a single core over the last interval.

disk_read_bytes

Total bytes read from storage across the process group.

disk_write_bytes

Total bytes written to storage across the process group.

net_rx_bytes

Total network bytes received (excluding loopback).

net_tx_bytes

Total network bytes transmitted (excluding loopback).

num_procs

Number of processes in the process group.

private_bytes

Total private memory (Private_Clean + Private_Dirty) in bytes.

pss_bytes

Total proportional set size in bytes across the process group.

rss_bytes

Total resident set size in bytes across the process group.

pid

PID of the monitoring process.

pgid

Process group ID being monitored.

cpu_percent: float | None = None

CPU utilization as a percentage of a single core over the last interval.

Computed as delta_cpu_usec / delta_wall_usec * 100. A value of 200.0 means two cores were fully utilized. None on the first snapshot (no previous value to diff against).

disk_read_bytes: int | None = None

Total bytes read from storage across the process group.

disk_write_bytes: int | None = None

Total bytes written to storage across the process group.

net_rx_bytes: int | None = None

Total network bytes received (excluding loopback).

net_tx_bytes: int | None = None

Total network bytes transmitted (excluding loopback).

num_procs: int | None = None

Number of processes in the process group.

pgid: int

Process group ID being monitored.

pid: int

PID of the monitoring process.

private_bytes: int | None = None

Total private memory (Private_Clean + Private_Dirty) in bytes.

Only pages exclusive to each process. None when smaps_rollup is unavailable.

pss_bytes: int | None = None

Total proportional set size in bytes across the process group.

Shared pages are divided by the number of sharers, giving the most accurate view of actual physical memory cost. None when smaps_rollup is unavailable.

rss_bytes: int | None = None

Total resident set size in bytes across the process group.

Overcounts physical memory when pages are shared across processes. See class docstring for details.