Benchmark tarfile

Benchmark script for spdl.io.iter_tarfile() function.

This script benchmarks the performance of iter_tarfile() against Python’s built-in tarfile module using multi-threading. Two types of inputs are tested for iter_tarfile(). Byte string and a file-like object returns byte string by chunk.

The benchmark:

  1. Creates test tar archives with various numbers of files

  2. Runs both implementations with different thread counts

  3. Measures queries per second (QPS) for each configuration

  4. Plots the results comparing the three implementations

Example

$ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
# Plot results
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
# Plot results without load_wav
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \
  --filter '4. SPDL iter_tarfile (bytes w/o convert)'

Result

The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.

../_static/data/example_benchmark_tarfile.png ../_static/data/example_benchmark_tarfile_2.png

The spdl.io.iter_tarfile() function processes data fastest when the input is a byte string. Its performance is consistent across different file sizes. This is because, when the entire TAR file is loaded into memory as a contiguous array, the function only needs to read the header and return the address of the corresponding data (note that iter_tarfile() returns a memory view when the input is a byte string). Since reading the header is very fast, most of the time is spent creating memory view objects while holding the GIL (Global Interpreter Lock). As a result, the speed of loading files decreases as more threads are used.

When the input data type is switched from a byte string to a file-like object, the performance of spdl.io.iter_tarfile() is also affected by the size of the input data. This is because data is processed incrementally, and for each file in the TAR archive, a new byte string object is created. The implementation tries to request the exact amount of bytes needed, but file-like objects do not guarantee that they return the requested length, instead, they return at most the requested number of bytes. Therefore, many intermediate byte string objects must be created. As the file size grows, it takes longer to process the data. Since the GIL must be locked while byte strings are created, performance degrades as more threads are used. At some point, the performance becomes similar to Python’s built-in tarfile module, which is a pure-Python implementation and thus holds the GIL almost entirely.

Source

Source

Click here to see the source.
  1#!/usr/bin/env python3
  2# Copyright (c) Meta Platforms, Inc. and affiliates.
  3# All rights reserved.
  4#
  5# This source code is licensed under the BSD-style license found in the
  6# LICENSE file in the root directory of this source tree.
  7
  8# pyre-strict
  9
 10"""Benchmark script for :py:func:`spdl.io.iter_tarfile` function.
 11
 12This script benchmarks the performance of :py:func:`~spdl.io.iter_tarfile` against
 13Python's built-in :py:mod:`tarfile` module using multi-threading.
 14Two types of inputs are tested for :py:func:`~spdl.io.iter_tarfile`.
 15Byte string and a file-like object returns byte string by chunk.
 16
 17The benchmark:
 18
 191. Creates test tar archives with various numbers of files
 202. Runs both implementations with different thread counts
 213. Measures queries per second (QPS) for each configuration
 224. Plots the results comparing the three implementations
 23
 24**Example**
 25
 26.. code-block:: shell
 27
 28   $ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
 29   # Plot results
 30   $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
 31   # Plot results without load_wav
 32   $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \\
 33     --filter '4. SPDL iter_tarfile (bytes w/o convert)'
 34
 35**Result**
 36
 37The following plot shows the QPS (measured by the number of files processed) of each
 38functions with different file size.
 39
 40.. image:: ../../_static/data/example_benchmark_tarfile.png
 41
 42.. image:: ../../_static/data/example_benchmark_tarfile_2.png
 43
 44The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
 45string.
 46Its performance is consistent across different file sizes.
 47This is because, when the entire TAR file is loaded into memory as a contiguous array,
 48the function only needs to read the header and return the address of the corresponding data
 49(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
 50string).
 51Since reading the header is very fast, most of the time is spent creating memory view objects
 52while holding the GIL (Global Interpreter Lock).
 53As a result, the speed of loading files decreases as more threads are used.
 54
 55When the input data type is switched from a byte string to a file-like object,
 56the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
 57the input data.
 58This is because data is processed incrementally, and for each file in the TAR archive,
 59a new byte string object is created.
 60The implementation tries to request the exact amount of bytes needed, but file-like objects
 61do not guarantee that they return the requested length,
 62instead, they return at most the requested number of bytes.
 63Therefore, many intermediate byte string objects must be created.
 64As the file size grows, it takes longer to process the data.
 65Since the GIL must be locked while byte strings are created,
 66performance degrades as more threads are used.
 67At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
 68which is a pure-Python implementation and thus holds the GIL almost entirely.
 69"""
 70
 71__all__ = [
 72    "BenchmarkConfig",
 73    "BenchmarkResult",
 74    "create_test_tar",
 75    "iter_tarfile_builtin",
 76    "main",
 77    "process_tar_builtin",
 78    "process_tar_spdl",
 79    "process_tar_spdl_filelike",
 80]
 81
 82import argparse
 83import io
 84import os
 85import tarfile
 86from collections.abc import Callable, Iterator
 87from dataclasses import dataclass
 88from functools import partial
 89
 90import spdl.io
 91
 92try:
 93    from examples.benchmark_utils import (  # pyre-ignore[21]
 94        BenchmarkResult,
 95        BenchmarkRunner,
 96        ExecutorType,
 97        get_default_result_path,
 98        save_results_to_csv,
 99    )
100except ImportError:
101    from spdl.examples.benchmark_utils import (
102        BenchmarkResult,
103        BenchmarkRunner,
104        ExecutorType,
105        get_default_result_path,
106        save_results_to_csv,
107    )
108
109
110DEFAULT_RESULT_PATH: str = get_default_result_path(__file__)
111
112
113@dataclass
114class BenchmarkConfig:
115    """Configuration for a single TAR benchmark run."""
116
117    function_name: str
118    tar_size: int
119    file_size: int
120    num_files: int
121    num_threads: int
122    num_iterations: int
123    total_files_processed: int
124
125
126def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
127    """Iterate over TAR file using Python's built-in ``tarfile`` module.
128
129    Args:
130        tar_data: TAR archive as bytes.
131
132    Yields:
133        Tuple of ``(filename, content)`` for each file in the archive.
134    """
135    with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
136        for member in tar.getmembers():
137            if member.isfile():
138                file_obj = tar.extractfile(member)
139                if file_obj:
140                    content = file_obj.read()
141                    yield member.name, content
142
143
144def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
145    """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
146
147    Args:
148        tar_data: TAR archive as bytes.
149
150    Returns:
151        Number of files processed.
152    """
153    count = 0
154    if convert:
155        for _, content in spdl.io.iter_tarfile(tar_data):
156            bytes(content)
157            count += 1
158        return count
159    else:
160        for _ in spdl.io.iter_tarfile(tar_data):
161            count += 1
162        return count
163
164
165def process_tar_builtin(tar_data: bytes) -> int:
166    """Process TAR archive using Python's built-in ``tarfile`` module.
167
168    Args:
169        tar_data: TAR archive as bytes.
170
171    Returns:
172        Number of files processed.
173    """
174    count = 0
175    for _ in iter_tarfile_builtin(tar_data):
176        count += 1
177    return count
178
179
180def process_tar_spdl_filelike(tar_data: bytes) -> int:
181    """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
182
183    Args:
184        tar_data: TAR archive as bytes.
185
186    Returns:
187        Number of files processed.
188    """
189    count = 0
190    file_like = io.BytesIO(tar_data)
191    for _ in spdl.io.iter_tarfile(file_like):  # pyre-ignore[6]
192        count += 1
193    return count
194
195
196def _size_str(n: int) -> str:
197    if n < 1024:
198        return f"{n} B"
199    if n < 1024 * 1024:
200        return f"{n / 1024: .2f} kB"
201    if n < 1024 * 1024 * 1024:
202        return f"{n / (1024 * 1024): .2f} MB"
203    return f"{n / (1024 * 1024 * 1024): .2f} GB"
204
205
206def create_test_tar(num_files: int, file_size: int) -> bytes:
207    """Create a TAR archive in memory with specified number of files.
208
209    Args:
210        num_files: Number of files to include in the archive.
211        file_size: Size of each file in bytes.
212
213    Returns:
214        TAR archive as bytes.
215    """
216    tar_buffer = io.BytesIO()
217    with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
218        for i in range(num_files):
219            filename = f"file_{i:06d}.txt"
220            content = b"1" * file_size
221            info = tarfile.TarInfo(name=filename)
222            info.size = len(content)
223            tar.addfile(info, io.BytesIO(content))
224    tar_buffer.seek(0)
225    return tar_buffer.getvalue()
226
227
228def _parse_args() -> argparse.Namespace:
229    """Parse command line arguments.
230
231    Returns:
232        Parsed arguments.
233    """
234    parser = argparse.ArgumentParser(
235        description="Benchmark iter_tarfile performance with multi-threading"
236    )
237    parser.add_argument(
238        "--num-files",
239        type=int,
240        default=100,
241        help="Number of files in the test TAR archive",
242    )
243    parser.add_argument(
244        "--num-iterations",
245        type=int,
246        default=100,
247        help="Number of iterations for each thread count",
248    )
249    parser.add_argument(
250        "--output",
251        type=lambda p: os.path.realpath(p),
252        default=DEFAULT_RESULT_PATH,
253        help="Output path for the results",
254    )
255
256    return parser.parse_args()
257
258
259def main() -> None:
260    """Main entry point for the benchmark script.
261
262    Parses command-line arguments, runs benchmarks, and generates plots.
263    """
264
265    args = _parse_args()
266
267    # Define explicit configuration lists
268    thread_counts = [1, 4, 8, 16, 32]
269    file_sizes = [2**8, 2**12, 2**16, 2**20]
270
271    # Define benchmark function configurations
272    # (function_name, function)
273    benchmark_functions: list[tuple[str, Callable[[bytes], int]]] = [
274        ("1. Python tarfile", process_tar_builtin),
275        ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
276        (
277            "3. SPDL iter_tarfile (bytes w/ convert)",
278            partial(process_tar_spdl, convert=True),
279        ),
280        (
281            "4. SPDL iter_tarfile (bytes w/o convert)",
282            partial(process_tar_spdl, convert=False),
283        ),
284    ]
285
286    print("Starting benchmark with configuration:")
287    print(f"  Number of files: {args.num_files}")
288    print(f"  File sizes: {file_sizes} bytes")
289    print(f"  Iterations per thread count: {args.num_iterations}")
290    print(f"  Thread counts: {thread_counts}")
291
292    results: list[BenchmarkResult[BenchmarkConfig]] = []
293    num_runs = 5
294
295    for num_threads in thread_counts:
296        with BenchmarkRunner(
297            executor_type=ExecutorType.THREAD,
298            num_workers=num_threads,
299            warmup_iterations=10 * num_threads,
300        ) as runner:
301            for file_size in file_sizes:
302                tar_data = create_test_tar(args.num_files, file_size)
303                for func_name, func in benchmark_functions:
304                    print(
305                        f"TAR size: {_size_str(len(tar_data))} "
306                        f"({args.num_files} x {_size_str(file_size)}), "
307                        f"'{func_name}', {num_threads} threads"
308                    )
309
310                    total_files_processed = args.num_files * args.num_iterations
311
312                    config = BenchmarkConfig(
313                        function_name=func_name,
314                        tar_size=len(tar_data),
315                        file_size=file_size,
316                        num_files=args.num_files,
317                        num_threads=num_threads,
318                        num_iterations=args.num_iterations,
319                        total_files_processed=total_files_processed,
320                    )
321
322                    result, _ = runner.run(
323                        config,
324                        partial(func, tar_data),
325                        args.num_iterations,
326                        num_runs=num_runs,
327                    )
328
329                    margin = (result.ci_upper - result.ci_lower) / 2
330                    print(
331                        f"  QPS: {result.qps:8.2f} ± {margin:.2f}  "
332                        f"({result.ci_lower:.2f}-{result.ci_upper:.2f}, "
333                        f"{num_runs} runs, {total_files_processed} files)"
334                    )
335
336                    results.append(result)
337
338    # Save results to CSV
339    save_results_to_csv(results, args.output)
340
341    print(
342        f"Benchmark complete. To generate plots, run: "
343        f"python benchmark_tarfile_plot.py --input {args.output} "
344        f"--output {args.output.replace('.csv', '.png')}"
345    )
346
347
348if __name__ == "__main__":
349    main()

Functions

Functions

create_test_tar(num_files: int, file_size: int) bytes[source]

Create a TAR archive in memory with specified number of files.

Parameters:
  • num_files – Number of files to include in the archive.

  • file_size – Size of each file in bytes.

Returns:

TAR archive as bytes.

iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]][source]

Iterate over TAR file using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Yields:

Tuple of (filename, content) for each file in the archive.

main() None[source]

Main entry point for the benchmark script.

Parses command-line arguments, runs benchmarks, and generates plots.

process_tar_builtin(tar_data: bytes) int[source]

Process TAR archive using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl(tar_data: bytes, convert: bool) int[source]

Process TAR archive using spdl.io.iter_tarfile().

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl_filelike(tar_data: bytes) int[source]

Process TAR archive using spdl.io.iter_tarfile() with file-like object.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

Classes

Classes

class BenchmarkConfig(function_name: str, tar_size: int, file_size: int, num_files: int, num_threads: int, num_iterations: int, total_files_processed: int)[source]

Configuration for a single TAR benchmark run.

file_size: int
function_name: str
num_files: int
num_iterations: int
num_threads: int
tar_size: int
total_files_processed: int
class BenchmarkResult[source]

Generic benchmark result containing configuration and performance metrics.

This class holds both the benchmark-specific configuration and the common performance statistics. It is parameterized by the config type, which allows each benchmark script to define its own configuration dataclass.

ci_lower: float

Lower bound of 95% confidence interval for QPS

ci_upper: float

Upper bound of 95% confidence interval for QPS

config: ConfigT

Benchmark-specific configuration (e.g., data format, file size, etc.)

date: str

When benchmark was run. ISO 8601 format.

executor_type: str

Type of executor used (thread, process, or interpreter)

free_threaded: bool

Whether Python is running with free-threaded ABI.

python_version: str

Python version used for the benchmark

qps: float

Queries per second (mean)