Analyzing the Performance
=========================
.. currentmodule:: spdl.pipeline
In this section, we examine the performance statistics gathered from
a production training system.
.. note::
To setup the pipeline to collect the runtime performance statistics,
please refer to the
`Collecting the Runtime Statistics <../getting_started/logging.html>`_
section and :py:mod:`performance_analysis` example.
The pipeline is composed of four stages: download, preprocess, batch,
and transfer. The following code snippet and diagram illustrate this.
.. code-block:: python
pipeline = (
PipelineBuilder()
.add_source(Dataset())
.aggregate(8)
.pipe(download, concurrency=12)
.disaggregate()
.pipe(preprocess, concurrency=12)
.aggregate(4)
.pipe(collate)
.pipe(gpu_transfer)
.add_sink()
.build(num_threads=12)
)
.. raw:: html
.. mermaid::
flowchart TB
subgraph Download
direction TB
t11[Task 1]
t12[Task 2]
t13[...]
t1n[Task 12]
end
subgraph Preprocessing
direction TB
t21[Task 1]
t22[Task 2]
t23[...]
t2n[12]
end
Source
--> Aggregate1["Aggregate (8)"]
--> Download
--> Disaggregate
--> Preprocessing
--> Aggregate2["Aggregate(4)"]
--> Collate
--> transfer[GPU Transfer]
--> Sink
.. raw:: html
TaskPerfStats
-------------
The :py:class:`TaskPerfStats` class contains information about the stage functions.
The stats are aggregated (averaged or summed) over the (possibly concurrent) invocations
within the measurement window.
Average Time
~~~~~~~~~~~~
The :py:attr:`TaskPerfStats.ave_time` is the average time of successful function
invocations. Typically, download time is the largest, followed by decoding/preprocessing,
collation, and transfer.
When decoding is lightweight but the data size is large (such as processing long audio
in WAV format), collation and transfer can take longer than preprocessing.
In the following, we can see that the duration for downloading data changes over the course
of training. We also see occasional spikes.
The performance of downloading is affected by both external factors (i.e. the state of
networking) and internal factors (such as the size of data requested).
As we will see later, the speed of download is fast enough for this case, so the overall
change is not an issue here.
Other stages show the consistent performance throughout the training.
.. raw:: html
The number of tasks executed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :py:attr:`TaskPerfStats.num_tasks` is the number of function invocation completed,
including both successful and failure completion.
You can obtain the number of successful task counts, by subtracting
:py:attr:`TaskPerfStats.num_failures` from ``TaskPerfStats.num_tasks``.
In this case, there was no failure during the training, so we omit the plot for that.
.. raw:: html
QueuePerfStats
--------------
When a pipeline is built, queues are inserted between each stage of the pipeline.
The outputs from stage functions are put in their corresponding queues.
The inputs to the stage functions are retrieved from queues upstream to them.
The following figure illustrates this.
Using queues makes it easy to generalize the architecture of pipeline, as
it allows to orchestrate stage functions independently.
.. raw:: html
.. mermaid::
flowchart LR
Queue
subgraph S1[Stage 1]
direction LR
T11[Task 1]
T12[Task 2]
T13[Task 3]
T14[Task ...]
end
Queue[Queue 1]
subgraph S2[Stage 2]
direction LR
T21[Task 1]
T22[Task 2]
T23[Task ...]
end
Queue2[Queue 2]
T11 -- Put --> Queue
T12 -- Put --> Queue
T13 -- Put --> Queue
T14 -- Put --> Queue
Queue ~~~ T21 -- Get --> Queue
Queue ~~~ T22 -- Get --> Queue
Queue ~~~ T23 -- Get --> Queue
T21 -- Put --> Queue2
T22 -- Put --> Queue2
T23 -- Put --> Queue2
.. raw:: html
.. note::
- In the following plots, a legned ```` indicates that it is referring
to the performance of the queue attached at the exit of the ````.
- The ``sink`` queue is the last queue of the pipeline and it is consumed by
the foreground (the training loop).
Throughput
~~~~~~~~~~
Each queue has a fixed-size buffer. So the buffer also serves as
a regulator of back pressure.
For this reason, the throughput is roughly same across stages.
The :py:attr:`QueuePerfStats.qps` represents the number of items that
went through the queue (fetched by the downstream stage) per second.
Note that due to the operations ``aggregate`` and ``disaggregate``,
the QPS values from different stages may live in different value
range, but they should match if normalized.
.. raw:: html
Data Readiness
~~~~~~~~~~~~~~
The goal of performance analysis is to make it easy to find the
performance bottleneck. Due to the concurrency and elastic nature of
the pipeline, this task is not as straightforward as we want it to be.
The "Data Readiness" is our latest attempt in finding the bottleneck.
It measures the relative time that a queue has the next data available.
We define the data readiness as follow
.. code-block:: text
data_readiness := 1 - d_empty / d_measure
where ``d_measure`` is the duration of the measurement (or elapsed time),
``d_empty`` is the duration the queue was empty during the measurement.
.. raw:: html
The data readiness close to 100% means that the data is available whenever
the downstream stage (including the foreground training loop) tries to
fetch the next data, indicating that the data processing pipeline is
faster than the demand.
On the other hand, the data readiness close to 0% suggests that downstream
stages are starving. They fetch the next data as soon as it is available,
but the upstream is not able to fill it quick enough, indicating that the
data processing pipeline is not fast enough, and the system is
experiencing data starvation.
The following plot shows the data readiness of the same training run we observed.
From the source to up to the collate, the data readiness are close or above 99%.
But at GPU transfer, it drops to 85% and the sink readiness follows.
.. raw:: html
Note that the value of data readiness is meaningful only in the context of
stages before and after the queue. It is not valid to compare the readiness of
different queues.
The drop at preprocess means that the downstream stage (aggregate) is consuming
the results faster than the rate of production.
The recovery of the data readiness in collate means that the rate at which
the downstream stage (gpu_transfer) is consuming the result is not as fast
as the rate of production.
The data readiness of the sink slightly recovers.
This suggests that the rate at which the foreground training loop consume the
data are slightly slower than the rate of transfer.
Queue Get/Put Time
~~~~~~~~~~~~~~~~~~
Other stats related to data readiness are the time stage functions wait for the
completion of queue get and put operations.
If data is always available (i.e. data readiness is close to 100%), then stages
won't be blocked when fetching the next data to process.
Therefore, the wait time for getting the next data is minimal.
Conversely, if the upstream is not producing data fast enough, the
downstream stages will be blocked while fetching the next data.
.. raw:: html
The time for put operation exhibits opposite trend.
If the data readiness is close to 100%, queues are full, and stage functions
cannot put the next result, so they must wait until a spot becomes available.
As a result, the wait time increases.
The wait time increases with higher concurrency, as multiple function invocations
attempt to put results into the queue.
If the production rate is slow, downstream stages wait for the next items.
Consequently, items placed in the queue are fetched immediately,
resulting in a less occupied queue.
This leads to the situation where the put time becomes close to zero.
The following plot shows that download and collate stages wait longer because
it is faster than the rate of the consumption of their downstreams
(preprocess and gpu_transfer).
.. raw:: html
Summary
-------
SPDL Pipeline exposes performance statistics which help identifying the
performance bottleneck.
The most important metrics is the queue get time of sink stage, because the
wait time directly translates to the time the foreground training loop
waits for.
.. include:: ../plots/perf_analysis.txt