Paradigm Shift

Concurrency Structure

When using SPDL, it is important to understand the difference in how SPDL structure the concurrency, compared against common process-based data loaders.

In process-based data loading, each process runs the entire pipeline. The pipeline is implemented as DataSet.

flowchart subgraph P1[Process 3] direction TB S10[src] --> S11[Stage 1] --> S12[Stage 2] end subgraph P2[Process 2] direction TB S20[src] --> S21[Stage 1] --> S22[Stage 2] end subgraph P3[Process 1] direction TB S30[src] --> S31[Stage 1] --> S32[Stage 2] end

Whereas SPDL parallelizes the the pipeline stage-by-stage, using different concurrency. This approach is better fit for achieving higher throughput.

flowchart subgraph P1[Stage 1] direction TB T11[Task 1] T12[Task 2] T13[Task 3] end subgraph P2[Stage 2] direction TB T21[Task 1] T22[Task 2] end subgraph P3[Stage 3] direction TB T31[Task 1] end src --> P1 --> P2 --> P3

It is worth noting that in this setup, there is no equivalent of Dataset class.

This paradigm shift makes it difficult to achieve mechanical update (such as one-line-change or swap-the-class type of update) to SPDL.