spdl.pipeline.Pipeline¶

class Pipeline[source]¶

Data processing pipeline. Use PipelineBuilder to instantiate.

See also

Building and Running Pipeline explains the basic usage of PipelineBuilder and Pipeline.
⚠ Caveats ⚠ lists known anti-patterns that can cause a deadlock.
Pipeline Parallelism covers how to switch (or combine) multi-threading and multi-processing in detail.

Pipeline and PipelineBuilder facilitate building data processing pipeline consists of multiple stages of async operations. It allows to configure the concurrency of each stage independently.

Typically, the source is a lightweight (synchronous) iterable that generates the source location of data, such as file paths and URLs. The first stage retrieves data from the (network) storage.

The subsequent stages process the data, such as decoding images and resizing them, or decoding audio and resampling them.

After the preprocessings are done, the data are buffered in a sink, which is a queue.

The pipeline is executed in a background thread, so that the main thread can perform other tasks while the data are being processed.

The following diagram illustrates this.

flowchart TD Source["Source (Iterator)"] Queue subgraph Op1["Op1 (Concurrency = 4)"] op1_1(Task 1-1) op1_2(Task 1-2) op1_3(Task 1-3) op1_4(Task 1-4) end subgraph Op2["Op2 (Concurrency=2)"] op2_1(Task 2-1) op2_2(Task 2-2) end Queue["Sink (Queue)"] Source --> Op1 Op1 --> Op2 Op2 --> Queue

Example: Bulk loading images

import asyncio

import spdl.io

def source():
    with open("images.txt") as f:
        for path in f:
            yield path

def load(path):
    return await spdl.io.load_image(path)


pipeline: Pipeline = (
    PipelineBuilder()
    .add_source(source())
    .pipe(decode, concurrency=10)
    .add_sink(3)
    .build(num_threads=10)
)

for item in pipeline.get_iterator(timeout=30):
    # do something with the decoded image
    ...

A Pipeline cleans up its background thread and worker processes automatically, so a forgotten pipeline will not hang the process at exit: it is stopped when the object is garbage collected, and — as a safety net for a reference held until the program ends — by a hook that start() registers to run at interpreter shutdown. Even so, it is recommended to release the resources explicitly once you are done, either by calling stop() (or using the auto_stop() context manager) or by dropping all strong references to the Pipeline so it is garbage collected. This frees the background thread, worker processes, and memory promptly, rather than leaving them alive until exit.

Changed in version 0.4.0: Calling start() and stop() is now optional. When iterating a pipeline that has not been explicitly started, the background thread is started automatically on the first item request. When the Pipeline object is garbage collected, the background thread is stopped automatically via weakref.finalize. Explicit start() / stop() and the auto_stop() context manager continue to work as before.

Changed in version 0.6.0: Cleanup is now more robust: start() registers a hook (via threading._register_atexit) that stops a still-running pipeline at interpreter shutdown, so a reference held until the program ends no longer risks a hang at exit. Explicitly releasing resources is still recommended — call stop() (or use auto_stop()), or drop all strong references so the pipeline is garbage collected — to free them promptly. See start() for details.

Methods

`auto_stop`(*[, timeout])	Context manager to start/stop the background thread automatically.
`get_item`(*[, timeout])	Get the next item.
`get_iterator`(*[, timeout])	Get an iterator, which iterates over the pipeline outputs.
`start`(*[, timeout])	Start the pipeline in background thread.
`stop`(*[, timeout])	Stop the pipeline.

__iter__() → Iterator[T][source]¶: Call get_iterator() without arguments.

auto_stop(*, timeout: float | None = None) → Iterator[None][source]¶

Context manager to start/stop the background thread automatically.

Parameters:: timeout – The duration to wait for the thread initialization / shutdown. [Unit: second] If None (default), it waits indefinitely.

get_item(*, timeout: float | None = None) → T[source]¶

Get the next item.

Parameters:

timeout – The duration to wait for the next item to become available. [Unit: second] If None (default), it waits indefinitely.

Raises:

RuntimeError – The pipeline is not started.
TimeoutError – When pipeline is not producing the next item within the given time.
EOFError – When the pipeline is exhausted or cancelled and there are no more items in the sink.

get_iterator(*, timeout: float | None = None) → Iterator[T][source]¶

Get an iterator, which iterates over the pipeline outputs.

The returned iterator covers a single epoch (one pass over the source), regardless of whether the source is continuous (see the continuous argument of PipelineBuilder.add_source). Call this method again to iterate each subsequent epoch:

for epoch in range(num_epochs):
    for item in pipeline.get_iterator(timeout=...):
        ...

Parameters:: timeout – Timeout value used for each get_item call.

Changed in version 0.6.0: Fixed reuse with a continuous source: an iterator that reached its epoch boundary used to resume into the next epoch when reused, but now stays exhausted, consistent with non-continuous sources. Use one iterator per epoch.

start(*, timeout: float | None = None, **kwargs: Any) → None[source]¶

Start the pipeline in background thread.

Parameters:: timeout – Timeout value used when starting the thread and waiting for the pipeline to be initialized. [Unit: second]

Note

Calling start multiple times raises RuntimeError.

Note

Cleanup at interpreter exit. The pipeline runs a background, non-daemon event-loop thread (and may own worker subprocesses). They are released when you stop() the pipeline, or when the object is garbage collected – a weakref.finalize() stops it at GC. But if a strong reference is held until the program ends (e.g. a training loop that keeps the dataloader for the whole run), GC does not run before interpreter shutdown, which would otherwise hang joining the still-running event-loop thread.

To prevent that, start() registers a process-wide hook that stops the pipeline at the very start of interpreter finalization. It holds only a weak reference, so it never keeps the pipeline alive; explicit stop() and the GC path are unaffected.

The hook uses threading._register_atexit and not atexit.register(), because of when CPython runs each. A plain atexit hook runs after non-daemon threads are already joined – too late. threading._register_atexit callbacks run earlier, inside threading._shutdown(), before that join (the same mechanism, and the same reason, that concurrent.futures uses):

flowchart TD subgraph C["run threading._register_atexit hooks (LIFO)"] S1["Pipeline's stop hook runs here: pipeline stopped"] end subgraph E["atexit hooks (LIFO)"] M1["multiprocessing joins children;"] M2["weakref.finalize"] end A["Python reaches the end of program"] --> B["threading._shutdown()"] --> C --> D["join non-daemon threads"] --> E --> F["GC and module teardown"]

The stop hook runs before the non-daemon-thread join and before the atexit phase, so the pipeline is torn down before anything blocks on it.

stop(*, timeout: float | None = None) → None[source]¶

Stop the pipeline.

Parameters:: timeout – Timeout value used when stopping the pipeline and waiting for the thread to join. [Unit: second]

Note

It is safe to call stop multiple times.