Analyzing the performance
=========================
.. currentmodule:: spdl.pipeline
In this section, we look at the performance statistics we gathered from
in a production training system.
The pipeline is composed 4 stages, that is, download, preprocess, batch
and transfer. The following code snippet and diagram illustrate this.
.. code-block:: python
pipeline = (
PipelineBuilder()
.add_source(Dataset())
.aggregate(8)
.pipe(download, concurrency=12)
.disaggregate()
.pipe(preprocess, concurrency=12)
.aggregate(4)
.pipe(collate)
.pipe(gpu_transfer)
.add_sink()
.build(num_threads=12)
)
.. raw:: html
.. mermaid::
flowchart TB
subgraph Download
direction TB
t11[Task 1]
t12[Task 2]
t13[...]
t1n[Task 12]
end
subgraph Preprocessing
direction TB
t21[Task 1]
t22[Task 2]
t23[...]
t2n[12]
end
Source
--> Aggregate1["Aggregate (8)"]
--> Download
--> Disaggregate
--> Preprocessing
--> Aggregate2["Aggregate(4)"]
--> Collate
--> transfer[GPU Transfer]
--> Sink
.. raw:: html
TaskPerfStats
-------------
The :py:class:`TaskPerfStats` class contains information about the stage functions.
The stats are aggregated (averaged or summed) over the (possibly concurrent) invocations
within the measurement window.
Average Time
~~~~~~~~~~~~
The :py:attr:`TaskPerfStats.ave_time` is the average time of successful function invocations.
Typicall, download time is the largest, decoding/preprocessing are second, followed by
collation and transfer.
When decoding is lightweight but the size of the data is large, (such as processing long
audio of WAV format.) then collation and transfer takes longer than preprocessing.
In the following, we can see that the duration for downloading data changes over the course
of training. We also see occasional spikes.
The performance of downloading is affected by both external factors (i.e. the state of
networking) and internal factors (such as the size of data requested).
As we will see later, the speed of download is fast enough for this case, so the overall
change is not an issue here.
Other stages show the consistent performance throughout the training.
.. raw:: html
The number of task executed
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :py:attr:`TaskPerfStats.num_tasks` is the number of function invocation completed,
including both successful and failure completion.
You can obtain the number of successful task counts, by subtracting
:py:attr:`TaskPerfStats.num_failures` from ``TaskPerfStats.num_tasks``.
In this case, there was no failure during the training, so we omit the plot for that.
.. raw:: html
QueuePerfStats
--------------
When a pipeline is built, queues are inserted between each stage of the pipeline.
The outputs from stage functions are put in their corresponding queues.
The inputs to the stage functions are retrieved from queues upstream to them.
The following figure illustrates this.
Using queues makes it easy to generalize the architecture of pipeline, as
it allows to orchestrate stage functions independently.
.. raw:: html
.. mermaid::
flowchart LR
Queue
subgraph S1[Stage 1]
direction LR
T11[Task 1]
T12[Task 2]
T13[Task 3]
T14[Task ...]
end
Queue[Queue 1]
subgraph S2[Stage 2]
direction LR
T21[Task 1]
T22[Task 2]
T23[Task ...]
end
Queue2[Queue 2]
T11 -- Put --> Queue
T12 -- Put --> Queue
T13 -- Put --> Queue
T14 -- Put --> Queue
Queue ~~~ T21 -- Get --> Queue
Queue ~~~ T22 -- Get --> Queue
Queue ~~~ T23 -- Get --> Queue
T21 -- Put --> Queue2
T22 -- Put --> Queue2
T23 -- Put --> Queue2
.. raw:: html
.. note::
- In the following plots, a legned ```` indicates that it is refering to
the performance of the queue attached at the exit of the ````.
- The ``sink`` queue is the last queue of the pipeline and it is consumed by
the foreground (the training loop).
Throughput
~~~~~~~~~~
Each queue has a fixed-size buffer. So the buffer also serves as
a regulator of back pressure.
For this reason, the throughput is roughly same across stages.
The :py:attr:`QueuePerfStats.qps` represents the number of items
went through the queue (fetched by the downstream stage) par second.
Note that due to the operations ``aggregate`` and ``disaggregate``,
the QPS values from different stages may live in different value
range, but they should match if normalized.
.. raw:: html
Data Readiness
~~~~~~~~~~~~~~
The goal of performance analysis is to make it easy to find the
performance bottleneck. Due to the concurrency and elastic nature of
the pipeline, this task is not as straightforward as we want it to be.
The "Data Readiness" is our latest attempt in finding the bottleneck.
It measures the relative time that a queue has the next data available.
We define the data readiness as follow
.. code-block:: text
data_readiness := 1 - d_empty / d_measure
where ``d_measure`` is the duration of the measurement (or elapsed time),
``d_empty`` is the duration the queue was empty during the measurement.
.. raw:: html
The data readiness close to 100% intuitively means that whenever
the downstream stage (including the foreground training loop) tries to
fetch the next data, the data is available, meaning that data processing
pipeline is faster than the demand.
On the other hand, the data readiness close to 0% suggests that downstream
stages are starving. They fetch the next data as soon as it is available,
but the upstream is not able to fill it quick enough, indicating that the
data processing pipeline is not fast enough, and the system is
experiencing data starvation.
The following plot shows the dat readiness of the same training run we saw.
From the source to up to the collate, the data readiness are close or above 99%.
But at GPU transfer, it drops to 85% and the sink readiness follows.
.. raw:: html
Note that the value of data readiness is meaningful only in the context of
stages before and after the queue. It is not valid to compare the readiness of
different queues.
The drop at preprocess means that the downstream stage (aggregate) is consuming
the results faster than the rate of production.
The recovery of the data readiness in collate means that the rate at which
the downstream stage (gpu_transfer) is consuming the result is not as fast
as the rate of production.
The data readiness of sink slightly recovers.
This suggests that the rate at which the foreground training loop consume the
data are slightly slower than the rate of transfer.
Queue Get/Put Time
~~~~~~~~~~~~~~~~~~
Other stats related to data readiness is the time stage functions wait for the
completion of queue get and put.
If data is always available (i.e. data readiness is close to 100%), then stages
won't get blocked on fetching the next data to process.
So the time to wait for getting the next data is small.
On the other hand, if the upstream is not producing data fast enough, then the
downstream stages are blocked on fetching the next data.
.. raw:: html
The time to put shows somewhat opposite trend.
If the data readiness is close to 100%, then queues are full, and stage functions
are not able to put the next result, so they have to wait until there is
an available spot. So the wait time gets longer. The wait time becomes longer
when higher concurrency is applied, as multiple function invocations try to
put results into the queue.
If the rate of production is slow, the downstream stages are waiting for the next
items. Then, the items put in the queue is fetched immediately, so the queue is
more empty. This leads to the situation where the put time becomes close to zero.
The following plot shows that download and collate stages wait longer because
it is faster than the rate of the consumption of their downstreams
(preprocess and gpu_transfer).
.. raw:: html
Summary
-------
SPDL Pipeline exposes performance statistics which help identifying the
performance bottleneck.
The most important metrics is the queue get time of sink stage, because the
wait time directly translates to the time the foreground training loop
waits for.
.. include:: ../plots/perf_analysis.txt