Analyzing the performance
=========================

.. currentmodule:: spdl.pipeline

In this section, we look at the performance statistics we gathered from
in a production training system.

The pipeline is composed 4 stages, that is, download, preprocess, batch
and transfer. The following code snippet and diagram illustrate this.

.. code-block:: python

   pipeline = (
       PipelineBuilder()
       .add_source(Dataset())
       .aggregate(8)
       .pipe(download, concurrency=12)
       .disaggregate()
       .pipe(preprocess, concurrency=12)
       .aggregate(4)
       .pipe(collate)
       .pipe(gpu_transfer)
       .add_sink()
       .build(num_threads=12)
   )


.. raw:: html

   <div style="width: 360px">

.. mermaid::

   flowchart TB
       subgraph Download
         direction TB
         t11[Task 1]
         t12[Task 2]
         t13[...]
         t1n[Task 12]
       end
       subgraph Preprocessing
         direction TB
         t21[Task 1]
         t22[Task 2]
         t23[...]
         t2n[12]
       end

       Source
        --> Aggregate1["Aggregate (8)"]
        --> Download
        --> Disaggregate
        --> Preprocessing
        --> Aggregate2["Aggregate(4)"]
        --> Collate
        --> transfer[GPU Transfer]
        --> Sink

.. raw:: html

   </div>


TaskPerfStats
-------------

The :py:class:`TaskPerfStats` class contains information about the stage functions.

The stats are aggregated (averaged or summed) over the (possibly concurrent) invocations
within the measurement window.

Average Time
~~~~~~~~~~~~

The :py:attr:`TaskPerfStats.ave_time` is the average time of successful function invocations.
Typicall, download time is the largest, decoding/preprocessing are second, followed by
collation and transfer.

When decoding is lightweight but the size of the data is large, (such as processing long
audio of WAV format.) then collation and transfer takes longer than preprocessing.

In the following, we can see that the duration for downloading data changes over the course
of training. We also see occasional spikes.
The performance of downloading is affected by both external factors (i.e. the state of
networking) and internal factors (such as the size of data requested).
As we will see later, the speed of download is fast enough for this case, so the overall
change is not an issue here.

Other stages show the consistent performance throughout the training.

.. raw:: html

   <div id='perf_analysis_ave_time'></div>

The number of task executed
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :py:attr:`TaskPerfStats.num_tasks` is the number of function invocation completed,
including both successful and failure completion.
You can obtain the number of successful task counts, by subtracting 
:py:attr:`TaskPerfStats.num_failures` from ``TaskPerfStats.num_tasks``.
In this case, there was no failure during the training, so we omit the plot for that.

.. raw:: html

    <div id='perf_analysis_num_tasks'></div>

QueuePerfStats
--------------

When a pipeline is built, queues are inserted between each stage of the pipeline.
The outputs from stage functions are put in their corresponding queues.
The inputs to the stage functions are retrieved from queues upstream to them.

The following figure illustrates this.
Using queues makes it easy to generalize the architecture of pipeline, as
it allows to orchestrate stage functions independently.

.. raw:: html

   <div style="width: 420px">

.. mermaid::

   flowchart LR
        Queue
        subgraph S1[Stage 1]
            direction LR
            T11[Task 1]
            T12[Task 2]
            T13[Task 3]
            T14[Task ...]
    	end
        Queue[Queue 1]
        subgraph S2[Stage 2]
            direction LR
            T21[Task 1]
            T22[Task 2]
            T23[Task ...]
        end
        Queue2[Queue 2]

        T11 -- Put --> Queue
        T12 -- Put --> Queue
        T13 -- Put --> Queue
        T14 -- Put --> Queue
        Queue ~~~ T21 -- Get --> Queue
        Queue ~~~ T22 -- Get --> Queue
        Queue ~~~ T23 -- Get --> Queue
        T21 -- Put --> Queue2
        T22 -- Put --> Queue2
        T23 -- Put --> Queue2

.. raw:: html

   </div>


.. note::

   - In the following plots, a legned ``<STAGE>`` indicates that it is refering to
     the performance of the queue attached at the exit of the ``<STAGE>``.
   - The ``sink`` queue is the last queue of the pipeline and it is consumed by
     the foreground (the training loop).


Throughput
~~~~~~~~~~

Each queue has a fixed-size buffer. So the buffer also serves as
a regulator of back pressure.
For this reason, the throughput is roughly same across stages.
The :py:attr:`QueuePerfStats.qps` represents the number of items
went through the queue (fetched by the downstream stage) par second.

Note that due to the operations ``aggregate`` and ``disaggregate``,
the QPS values from different stages may live in different value
range, but they should match if normalized.

.. raw:: html

    <div id='perf_analysis_qps'></div>


Data Readiness
~~~~~~~~~~~~~~

The goal of performance analysis is to make it easy to find the
performance bottleneck. Due to the concurrency and elastic nature of
the pipeline, this task is not as straightforward as we want it to be.

The "Data Readiness" is our latest attempt in finding the bottleneck.
It measures the relative time that a queue has the next data available.

We define the data readiness as follow

.. code-block:: text

   data_readiness := 1 - d_empty / d_measure

where ``d_measure`` is the duration of the measurement (or elapsed time),
``d_empty`` is the duration the queue was empty during the measurement.

.. raw:: html

   <div id='perf_analysis_data_readiness'></div>

The data readiness close to 100% intuitively means that whenever
the downstream stage (including the foreground training loop) tries to
fetch the next data, the data is available, meaning that data processing
pipeline is faster than the demand.

On the other hand, the data readiness close to 0% suggests that downstream
stages are starving. They fetch the next data as soon as it is available,
but the upstream is not able to fill it quick enough, indicating that the
data processing pipeline is not fast enough, and the system is
experiencing data starvation.

The following plot shows the dat readiness of the same training run we saw.
From the source to up to the collate, the data readiness are close or above 99%.
But at GPU transfer, it drops to 85% and the sink readiness follows.

.. raw:: html

    <div id='perf_analysis_occupancy_rate'></div>

Note that the value of data readiness is meaningful only in the context of
stages before and after the queue. It is not valid to compare the readiness of
different queues.

The drop at preprocess means that the downstream stage (aggregate) is consuming
the results faster than the rate of production.

The recovery of the data readiness in collate means that the rate at which
the downstream stage (gpu_transfer) is consuming the result is not as fast
as the rate of production.

The data readiness of sink slightly recovers.
This suggests that the rate at which the foreground training loop consume the
data are slightly slower than the rate of transfer.

Queue Get/Put Time
~~~~~~~~~~~~~~~~~~

Other stats related to data readiness is the time stage functions wait for the
completion of queue get and put.

If data is always available (i.e. data readiness is close to 100%), then stages
won't get blocked on fetching the next data to process.
So the time to wait for getting the next data is small.

On the other hand, if the upstream is not producing data fast enough, then the
downstream stages are blocked on fetching the next data.

.. raw:: html

    <div id='perf_analysis_queue_get'></div>

The time to put shows somewhat opposite trend.

If the data readiness is close to 100%, then queues are full, and stage functions
are not able to put the next result, so they have to wait until there is
an available spot. So the wait time gets longer. The wait time becomes longer
when higher concurrency is applied, as multiple function invocations try to
put results into the queue.

If the rate of production is slow, the downstream stages are waiting for the next
items. Then, the items put in the queue is fetched immediately, so the queue is
more empty. This leads to the situation where the put time becomes close to zero.

The following plot shows that download and collate stages wait longer because
it is faster than the rate of the consumption of their downstreams
(preprocess and gpu_transfer).

.. raw:: html

    <div id='perf_analysis_queue_put'></div>

Summary
-------

SPDL Pipeline exposes performance statistics which help identifying the
performance bottleneck.

The most important metrics is the queue get time of sink stage, because the
wait time directly translates to the time the foreground training loop
waits for.

.. include:: ../plots/perf_analysis.txt