Shared-Memory Arena

High CPU utilization starves the GPU — the noisy neighbour effect — so an efficient training pipeline keeps total CPU utilization low.

Running the data pipeline in a subprocess (run_pipeline_in_subprocess()) isolates the data-loading CPU work from the training process, which already helps. But it introduces a new cost: every item produced in the subprocess must cross the process boundary to reach the training process. By default this is done by pickling the item and copying the bytes through a multiprocessing queue — work that is paid on both sides and that grows with the payload size. For pipelines that move large payloads per item (NumPy arrays, Torch tensors, raw bytes, or spdl.io.VideoPackets), this transfer is itself a meaningful source of host CPU usage — the very thing we are trying to keep low.

The shared-memory arena removes most of that cost.

How it works

An arena is a pre-allocated region of shared memory that both processes map. Instead of pickling a large binary into the IPC queue, the producer writes it into the arena and sends only a small reference; the consumer reads it back out of shared memory. The bulk bytes never travel through the pickle/queue path, so the per-item transfer CPU drops sharply.

Two backends are available, both passed via the arena argument of run_pipeline_in_subprocess():

  • SharedMemorySegmentPool — the consumer restores a zero-copy view that points directly at the shared memory, so it does essentially no per-item work. The pool blocks the producer when all segments are in use, so a slow consumer simply throttles the producer rather than overflowing. This is the recommended backend.

  • SharedMemoryRingBuffer — the consumer copies each payload out of the ring into a private buffer (so it never hands out a live view into shared memory). This is still far cheaper than pickling, but the ring is non-blocking: if it is too small for the number of in-flight items it raises shared-memory arena full. Size its capacity for the in-flight high-water mark (roughly (buffer_size + 2) × max_item_bytes).

from spdl.pipeline import SharedMemorySegmentPool, run_pipeline_in_subprocess

# segment_size: the largest single item; count: a few in-flight units.
arena = SharedMemorySegmentPool(segment_size=64 << 20, count=8)

source = run_pipeline_in_subprocess(
    backend.get_config(), num_threads=num_threads, arena=arena
)

Effect on CPU, throughput, and memory

The benchmark_arena_transport example ships a fixed dataset from a subprocess to the main process over the three transports — plain IPC (no arena), the ring buffer, and the segment pool — and measures, per configuration, the receive throughput, the CPU time spent, and the peak resident memory.

The figure below summarizes a run on a CPU-only host (one row per payload type; columns are throughput, CPU, and peak memory; bars compare the three transports):

../_images/example_benchmark_arena_transport.png

The pattern is consistent across payload types:

  • CPU — the segment pool spends close to zero CPU moving a payload, while plain IPC spends several CPU-seconds per configuration at 32 MiB pickling and copying. This is the property that matters most for training: it returns host CPU to the budget.

  • Throughput — the arena also moves data faster, and the gap widens with payload size (largest for bytes and NumPy arrays).

  • Peak memory — for bytes / NumPy / packets the transient pickle buffers of plain IPC are the heaviest; the pool’s fixed shared region is the lightest.

Torch tensors are a special case: torch’s own multiprocessing reducer already moves a tensor’s storage through shared memory, so plain IPC does not pay a bulk copy for them — the arena’s CPU win is therefore smallest for tensors (though the pool still drops their transfer CPU to ~0).

Relation to the noisy neighbour

The arena does not make the data more available — that is what buffering and concurrency tuning are for. What it does is make the transfer cheaper in CPU. Because the host CPU is the shared resource that launches GPU kernels, every CPU-second the arena saves on IPC is a CPU-second available for timely kernel launches. In other words, the arena attacks the noisy neighbour from the supply side: it lets a subprocess pipeline move large payloads while consuming far less of the CPU budget that the training loop depends on.

When to use it

  • Use an arena when an MTP pipeline moves large payloads per item (≳ 1 MiB): decoded tensors, large arrays, raw bytes, or demuxed packets.

  • Prefer SharedMemorySegmentPool (zero-copy, blocking). Use SharedMemoryRingBuffer when a copy-out is acceptable, and size it for the in-flight high-water mark.

  • For small payloads the transfer CPU is already negligible, so the arena adds complexity without benefit — skip it.