Paradigm Shift¶
Concurrency Structure¶
When using SPDL, it is important to understand the difference in how SPDL structure the concurrency, compared against common process-based data loaders.
In process-based data loading, each process runs the entire pipeline. The pipeline is implemented as DataSet.
Process 1
Stage 2
src
Stage 1
Process 2
Stage 2
src
Stage 1
Process 3
Stage 2
src
Stage 1
Whereas SPDL parallelizes the the pipeline stage-by-stage, using different concurrency. This approach is better fit for achieving higher throughput.
Stage 3
Task 1
Stage 2
Task 1
Task 2
Stage 1
Task 1
Task 2
Task 3
src
It is worth noting that in this setup, there is no equivalent of Dataset class.
This paradigm shift makes it difficult to achieve mechanical update (such as one-line-change or swap-the-class type of update) to SPDL.