Benchmark tarfile

Benchmark script for spdl.io.iter_tarfile() function.

This script benchmarks the performance of iter_tarfile() against Python’s built-in tarfile module using multi-threading. Two types of inputs are tested for iter_tarfile(). Byte string and a file-like object returns byte string by chunk.

The benchmark:

  1. Creates test tar archives with various numbers of files

  2. Runs both implementations with different thread counts

  3. Measures queries per second (QPS) for each configuration

  4. Plots the results comparing the three implementations

Example

$ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
# Plot results
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
# Plot results without load_wav
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \
  --filter '4. SPDL iter_tarfile (bytes w/o convert)'

Result

The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.

../_static/data/example_benchmark_tarfile.png ../_static/data/example_benchmark_tarfile_2.png

The spdl.io.iter_tarfile() function processes data fastest when the input is a byte string. Its performance is consistent across different file sizes. This is because, when the entire TAR file is loaded into memory as a contiguous array, the function only needs to read the header and return the address of the corresponding data (note that iter_tarfile() returns a memory view when the input is a byte string). Since reading the header is very fast, most of the time is spent creating memory view objects while holding the GIL (Global Interpreter Lock). As a result, the speed of loading files decreases as more threads are used.

When the input data type is switched from a byte string to a file-like object, the performance of spdl.io.iter_tarfile() is also affected by the size of the input data. This is because data is processed incrementally, and for each file in the TAR archive, a new byte string object is created. The implementation tries to request the exact amount of bytes needed, but file-like objects do not guarantee that they return the requested length, instead, they return at most the requested number of bytes. Therefore, many intermediate byte string objects must be created. As the file size grows, it takes longer to process the data. Since the GIL must be locked while byte strings are created, performance degrades as more threads are used. At some point, the performance becomes similar to Python’s built-in tarfile module, which is a pure-Python implementation and thus holds the GIL almost entirely.

Source

Source

Click here to see the source.
  1#!/usr/bin/env python3
  2# Copyright (c) Meta Platforms, Inc. and affiliates.
  3# All rights reserved.
  4#
  5# This source code is licensed under the BSD-style license found in the
  6# LICENSE file in the root directory of this source tree.
  7
  8# pyre-strict
  9
 10"""Benchmark script for :py:func:`spdl.io.iter_tarfile` function.
 11
 12This script benchmarks the performance of :py:func:`~spdl.io.iter_tarfile` against
 13Python's built-in :py:mod:`tarfile` module using multi-threading.
 14Two types of inputs are tested for :py:func:`~spdl.io.iter_tarfile`.
 15Byte string and a file-like object returns byte string by chunk.
 16
 17The benchmark:
 18
 191. Creates test tar archives with various numbers of files
 202. Runs both implementations with different thread counts
 213. Measures queries per second (QPS) for each configuration
 224. Plots the results comparing the three implementations
 23
 24**Example**
 25
 26.. code-block:: shell
 27
 28   $ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
 29   # Plot results
 30   $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
 31   # Plot results without load_wav
 32   $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \\
 33     --filter '4. SPDL iter_tarfile (bytes w/o convert)'
 34
 35**Result**
 36
 37The following plot shows the QPS (measured by the number of files processed) of each
 38functions with different file size.
 39
 40.. image:: ../../_static/data/example_benchmark_tarfile.png
 41
 42.. image:: ../../_static/data/example_benchmark_tarfile_2.png
 43
 44The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
 45string.
 46Its performance is consistent across different file sizes.
 47This is because, when the entire TAR file is loaded into memory as a contiguous array,
 48the function only needs to read the header and return the address of the corresponding data
 49(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
 50string).
 51Since reading the header is very fast, most of the time is spent creating memory view objects
 52while holding the GIL (Global Interpreter Lock).
 53As a result, the speed of loading files decreases as more threads are used.
 54
 55When the input data type is switched from a byte string to a file-like object,
 56the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
 57the input data.
 58This is because data is processed incrementally, and for each file in the TAR archive,
 59a new byte string object is created.
 60The implementation tries to request the exact amount of bytes needed, but file-like objects
 61do not guarantee that they return the requested length,
 62instead, they return at most the requested number of bytes.
 63Therefore, many intermediate byte string objects must be created.
 64As the file size grows, it takes longer to process the data.
 65Since the GIL must be locked while byte strings are created,
 66performance degrades as more threads are used.
 67At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
 68which is a pure-Python implementation and thus holds the GIL almost entirely.
 69"""
 70
 71__all__ = [
 72    "BenchmarkConfig",
 73    "create_test_tar",
 74    "iter_tarfile_builtin",
 75    "main",
 76    "process_tar_builtin",
 77    "process_tar_spdl",
 78    "process_tar_spdl_filelike",
 79]
 80
 81import argparse
 82import io
 83import os
 84import tarfile
 85from collections.abc import Callable, Iterator
 86from dataclasses import dataclass
 87from functools import partial
 88
 89import spdl.io
 90
 91try:
 92    from examples.benchmark_utils import (  # pyre-ignore[21]
 93        BenchmarkResult,
 94        BenchmarkRunner,
 95        ExecutorType,
 96        get_default_result_path,
 97        save_results_to_csv,
 98    )
 99except ImportError:
100    from spdl.examples.benchmark_utils import (
101        BenchmarkResult,
102        BenchmarkRunner,
103        ExecutorType,
104        get_default_result_path,
105        save_results_to_csv,
106    )
107
108
109DEFAULT_RESULT_PATH: str = get_default_result_path(__file__)
110
111
112@dataclass
113class BenchmarkConfig:
114    """BenchmarkConfig()
115
116    Configuration for a single TAR benchmark run."""
117
118    function_name: str
119    """Name of the function being tested"""
120
121    tar_size: int
122    """Total size of the TAR archive in bytes"""
123
124    file_size: int
125    """Size of each file in the TAR archive in bytes"""
126
127    num_files: int
128    """Number of files in the TAR archive"""
129
130    num_threads: int
131    """Number of concurrent threads"""
132
133    num_iterations: int
134    """Number of iterations per run"""
135
136    total_files_processed: int
137    """Total number of files processed across all iterations"""
138
139
140def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
141    """Iterate over TAR file using Python's built-in ``tarfile`` module.
142
143    Args:
144        tar_data: TAR archive as bytes.
145
146    Yields:
147        Tuple of ``(filename, content)`` for each file in the archive.
148    """
149    with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
150        for member in tar.getmembers():
151            if member.isfile():
152                file_obj = tar.extractfile(member)
153                if file_obj:
154                    content = file_obj.read()
155                    yield member.name, content
156
157
158def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
159    """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
160
161    Args:
162        tar_data: TAR archive as bytes.
163
164    Returns:
165        Number of files processed.
166    """
167    count = 0
168    if convert:
169        for _, content in spdl.io.iter_tarfile(tar_data):
170            bytes(content)
171            count += 1
172        return count
173    else:
174        for _ in spdl.io.iter_tarfile(tar_data):
175            count += 1
176        return count
177
178
179def process_tar_builtin(tar_data: bytes) -> int:
180    """Process TAR archive using Python's built-in ``tarfile`` module.
181
182    Args:
183        tar_data: TAR archive as bytes.
184
185    Returns:
186        Number of files processed.
187    """
188    count = 0
189    for _ in iter_tarfile_builtin(tar_data):
190        count += 1
191    return count
192
193
194def process_tar_spdl_filelike(tar_data: bytes) -> int:
195    """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
196
197    Args:
198        tar_data: TAR archive as bytes.
199
200    Returns:
201        Number of files processed.
202    """
203    count = 0
204    file_like = io.BytesIO(tar_data)
205    for _ in spdl.io.iter_tarfile(file_like):  # pyre-ignore[6]
206        count += 1
207    return count
208
209
210def _size_str(n: int) -> str:
211    if n < 1024:
212        return f"{n} B"
213    if n < 1024 * 1024:
214        return f"{n / 1024: .2f} kB"
215    if n < 1024 * 1024 * 1024:
216        return f"{n / (1024 * 1024): .2f} MB"
217    return f"{n / (1024 * 1024 * 1024): .2f} GB"
218
219
220def create_test_tar(num_files: int, file_size: int) -> bytes:
221    """Create a TAR archive in memory with specified number of files.
222
223    Args:
224        num_files: Number of files to include in the archive.
225        file_size: Size of each file in bytes.
226
227    Returns:
228        TAR archive as bytes.
229    """
230    tar_buffer = io.BytesIO()
231    with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
232        for i in range(num_files):
233            filename = f"file_{i:06d}.txt"
234            content = b"1" * file_size
235            info = tarfile.TarInfo(name=filename)
236            info.size = len(content)
237            tar.addfile(info, io.BytesIO(content))
238    tar_buffer.seek(0)
239    return tar_buffer.getvalue()
240
241
242def _parse_args() -> argparse.Namespace:
243    """Parse command line arguments.
244
245    Returns:
246        Parsed arguments.
247    """
248    parser = argparse.ArgumentParser(
249        description="Benchmark iter_tarfile performance with multi-threading"
250    )
251    parser.add_argument(
252        "--num-files",
253        type=int,
254        default=100,
255        help="Number of files in the test TAR archive",
256    )
257    parser.add_argument(
258        "--num-iterations",
259        type=int,
260        default=100,
261        help="Number of iterations for each thread count",
262    )
263    parser.add_argument(
264        "--output",
265        type=lambda p: os.path.realpath(p),
266        default=DEFAULT_RESULT_PATH,
267        help="Output path for the results",
268    )
269
270    return parser.parse_args()
271
272
273def main() -> None:
274    """Main entry point for the benchmark script.
275
276    Parses command-line arguments, runs benchmarks, and generates plots.
277    """
278
279    args = _parse_args()
280
281    # Define explicit configuration lists
282    thread_counts = [1, 4, 8, 16, 32]
283    file_sizes = [2**8, 2**12, 2**16, 2**20]
284
285    # Define benchmark function configurations
286    # (function_name, function)
287    benchmark_functions: list[tuple[str, Callable[[bytes], int]]] = [
288        ("1. Python tarfile", process_tar_builtin),
289        ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
290        (
291            "3. SPDL iter_tarfile (bytes w/ convert)",
292            partial(process_tar_spdl, convert=True),
293        ),
294        (
295            "4. SPDL iter_tarfile (bytes w/o convert)",
296            partial(process_tar_spdl, convert=False),
297        ),
298    ]
299
300    print("Starting benchmark with configuration:")
301    print(f"  Number of files: {args.num_files}")
302    print(f"  File sizes: {file_sizes} bytes")
303    print(f"  Iterations per thread count: {args.num_iterations}")
304    print(f"  Thread counts: {thread_counts}")
305
306    results: list[BenchmarkResult[BenchmarkConfig]] = []
307    num_runs = 5
308
309    for num_threads in thread_counts:
310        with BenchmarkRunner(
311            executor_type=ExecutorType.THREAD,
312            num_workers=num_threads,
313            warmup_iterations=10 * num_threads,
314        ) as runner:
315            for file_size in file_sizes:
316                tar_data = create_test_tar(args.num_files, file_size)
317                for func_name, func in benchmark_functions:
318                    print(
319                        f"TAR size: {_size_str(len(tar_data))} "
320                        f"({args.num_files} x {_size_str(file_size)}), "
321                        f"'{func_name}', {num_threads} threads"
322                    )
323
324                    total_files_processed = args.num_files * args.num_iterations
325
326                    config = BenchmarkConfig(
327                        function_name=func_name,
328                        tar_size=len(tar_data),
329                        file_size=file_size,
330                        num_files=args.num_files,
331                        num_threads=num_threads,
332                        num_iterations=args.num_iterations,
333                        total_files_processed=total_files_processed,
334                    )
335
336                    result, _ = runner.run(
337                        config,
338                        partial(func, tar_data),
339                        args.num_iterations,
340                        num_runs=num_runs,
341                    )
342
343                    margin = (result.ci_upper - result.ci_lower) / 2
344                    print(
345                        f"  QPS: {result.qps:8.2f} ± {margin:.2f}  "
346                        f"({result.ci_lower:.2f}-{result.ci_upper:.2f}, "
347                        f"{num_runs} runs, {total_files_processed} files)"
348                    )
349
350                    results.append(result)
351
352    # Save results to CSV
353    save_results_to_csv(results, args.output)
354
355    print(
356        f"Benchmark complete. To generate plots, run: "
357        f"python benchmark_tarfile_plot.py --input {args.output} "
358        f"--output {args.output.replace('.csv', '.png')}"
359    )
360
361
362if __name__ == "__main__":
363    main()

API Reference

Functions

create_test_tar(num_files: int, file_size: int) bytes[source]

Create a TAR archive in memory with specified number of files.

Parameters:
  • num_files – Number of files to include in the archive.

  • file_size – Size of each file in bytes.

Returns:

TAR archive as bytes.

iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]][source]

Iterate over TAR file using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Yields:

Tuple of (filename, content) for each file in the archive.

main() None[source]

Main entry point for the benchmark script.

Parses command-line arguments, runs benchmarks, and generates plots.

process_tar_builtin(tar_data: bytes) int[source]

Process TAR archive using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl(tar_data: bytes, convert: bool) int[source]

Process TAR archive using spdl.io.iter_tarfile().

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl_filelike(tar_data: bytes) int[source]

Process TAR archive using spdl.io.iter_tarfile() with file-like object.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

Classes

class BenchmarkConfig[source]

Configuration for a single TAR benchmark run.

file_size: int

Size of each file in the TAR archive in bytes

function_name: str

Name of the function being tested

num_files: int

Number of files in the TAR archive

num_iterations: int

Number of iterations per run

num_threads: int

Number of concurrent threads

tar_size: int

Total size of the TAR archive in bytes

total_files_processed: int

Total number of files processed across all iterations