Benchmark tarfile

Benchmark script for iter_tarfile function.

This script benchmarks the performance of spdl.io.iter_tarfile() against Python’s built-in tarfile module using multi-threading. Two types of inputs are tested for spdl.io.iter_tarfile(). Byte string and a file-like object returns byte string by chunk.

The benchmark:

  1. Creates test tar archives with various numbers of files

  2. Runs both implementations with different thread counts

  3. Measures queries per second (QPS) for each configuration

  4. Plots the results comparing the three implementations

Example

$ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output iter_tarfile_benchmark_results.csv
# Plot results
$ python plot_tar_benchmark.py --input iter_tarfile_benchmark_results.csv --output wav_benchmark_plot.png
# Plot results without load_wav
$ python plot_tar_benchmark.py --input iter_tarfile_benchmark_results.csv --output wav_benchmark_plot_2.png --filter '4. SPDL iter_tarfile (bytes w/o convert)'

Result

The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.

../_static/data/example_benchmark_tarfile.png ../_static/data/example_benchmark_tarfile_2.png

The spdl.io.iter_tarfile() function processes data fastest when the input is a byte string. Its performance is consistent across different file sizes. This is because, when the entire TAR file is loaded into memory as a contiguous array, the function only needs to read the header and return the address of the corresponding data (note that iter_tarfile() returns a memory view when the input is a byte string). Since reading the header is very fast, most of the time is spent creating memory view objects while holding the GIL (Global Interpreter Lock). As a result, the speed of loading files decreases as more threads are used.

When the input data type is switched from a byte string to a file-like object, the performance of spdl.io.iter_tarfile() is also affected by the size of the input data. This is because data is processed incrementally, and for each file in the TAR archive, a new byte string object is created. The implementation tries to request the exact amount of bytes needed, but file-like objects do not guarantee that they return the requested length, instead, they return at most the requested number of bytes. Therefore, many intermediate byte string objects must be created. As the file size grows, it takes longer to process the data. Since the GIL must be locked while byte strings are created, performance degrades as more threads are used. At some point, the performance becomes similar to Python’s built-in tarfile module, which is a pure-Python implementation and thus holds the GIL almost entirely.

Source

Source

Click here to see the source.
  1#!/usr/bin/env python3
  2# Copyright (c) Meta Platforms, Inc. and affiliates.
  3# All rights reserved.
  4#
  5# This source code is licensed under the BSD-style license found in the
  6# LICENSE file in the root directory of this source tree.
  7
  8# pyre-unsafe
  9
 10"""Benchmark script for iter_tarfile function.
 11
 12This script benchmarks the performance of :py:func:`spdl.io.iter_tarfile` against
 13Python's built-in ``tarfile`` module using multi-threading.
 14Two types of inputs are tested for  :py:func:`spdl.io.iter_tarfile`.
 15Byte string and a file-like object returns byte string by chunk.
 16
 17The benchmark:
 18
 191. Creates test tar archives with various numbers of files
 202. Runs both implementations with different thread counts
 213. Measures queries per second (QPS) for each configuration
 224. Plots the results comparing the three implementations
 23
 24**Example**
 25
 26.. code-block:: shell
 27
 28   $ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output iter_tarfile_benchmark_results.csv
 29   # Plot results
 30   $ python plot_tar_benchmark.py --input iter_tarfile_benchmark_results.csv --output wav_benchmark_plot.png
 31   # Plot results without load_wav
 32   $ python plot_tar_benchmark.py --input iter_tarfile_benchmark_results.csv --output wav_benchmark_plot_2.png --filter '4. SPDL iter_tarfile (bytes w/o convert)'
 33
 34**Result**
 35
 36The following plot shows the QPS (measured by the number of files processed) of each
 37functions with different file size.
 38
 39.. image:: ../../_static/data/example_benchmark_tarfile.png
 40
 41.. image:: ../../_static/data/example_benchmark_tarfile_2.png
 42
 43The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
 44string.
 45Its performance is consistent across different file sizes.
 46This is because, when the entire TAR file is loaded into memory as a contiguous array,
 47the function only needs to read the header and return the address of the corresponding data
 48(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
 49string).
 50Since reading the header is very fast, most of the time is spent creating memory view objects
 51while holding the GIL (Global Interpreter Lock).
 52As a result, the speed of loading files decreases as more threads are used.
 53
 54When the input data type is switched from a byte string to a file-like object,
 55the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
 56the input data.
 57This is because data is processed incrementally, and for each file in the TAR archive,
 58a new byte string object is created.
 59The implementation tries to request the exact amount of bytes needed, but file-like objects
 60do not guarantee that they return the requested length,
 61instead, they return at most the requested number of bytes.
 62Therefore, many intermediate byte string objects must be created.
 63As the file size grows, it takes longer to process the data.
 64Since the GIL must be locked while byte strings are created,
 65performance degrades as more threads are used.
 66At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
 67which is a pure-Python implementation and thus holds the GIL almost entirely.
 68"""
 69
 70__all__ = [
 71    "BenchmarkResult",
 72    "benchmark",
 73    "create_test_tar",
 74    "iter_tarfile_builtin",
 75    "main",
 76    "save_results_to_csv",
 77    "process_tar_builtin",
 78    "process_tar_spdl",
 79    "process_tar_spdl_filelike",
 80    "run_benchmark",
 81]
 82
 83import argparse
 84import csv
 85import io
 86import logging
 87import sys
 88import tarfile
 89import time
 90from collections.abc import Callable, Iterator
 91from concurrent.futures import as_completed, ThreadPoolExecutor
 92from dataclasses import dataclass
 93from functools import partial
 94
 95import numpy as np
 96import spdl.io
 97
 98_LG = logging.getLogger(__name__)
 99
100
101def _get_python_info() -> tuple[str, bool]:
102    """Get Python version and free-threaded ABI information.
103
104    Returns:
105        Tuple of (python_version, is_free_threaded)
106    """
107    python_version = (
108        f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}"
109    )
110    # Check if Python is running with free-threaded ABI (PEP 703)
111    # _is_gil_enabled is only available in Python 3.13+
112    try:
113        is_free_threaded = sys._is_gil_enabled()  # pyre-ignore[16]
114    except AttributeError:
115        is_free_threaded = False
116    return python_version, is_free_threaded
117
118
119def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
120    """Iterate over TAR file using Python's built-in ``tarfile`` module.
121
122    Args:
123        tar_data: TAR archive as bytes.
124
125    Yields:
126        Tuple of ``(filename, content)`` for each file in the archive.
127    """
128    with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
129        for member in tar.getmembers():
130            if member.isfile():
131                file_obj = tar.extractfile(member)
132                if file_obj:
133                    content = file_obj.read()
134                    yield member.name, content
135
136
137def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
138    """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
139
140    Args:
141        tar_data: TAR archive as bytes.
142
143    Returns:
144        Number of files processed.
145    """
146    count = 0
147    if convert:
148        for _, content in spdl.io.iter_tarfile(tar_data):
149            bytes(content)
150            count += 1
151        return count
152    else:
153        for _ in spdl.io.iter_tarfile(tar_data):
154            count += 1
155        return count
156
157
158def process_tar_builtin(tar_data: bytes) -> int:
159    """Process TAR archive using Python's built-in ``tarfile`` module.
160
161    Args:
162        tar_data: TAR archive as bytes.
163
164    Returns:
165        Number of files processed.
166    """
167    count = 0
168    for _ in iter_tarfile_builtin(tar_data):
169        count += 1
170    return count
171
172
173def process_tar_spdl_filelike(tar_data: bytes) -> int:
174    """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
175
176    Args:
177        tar_data: TAR archive as bytes.
178
179    Returns:
180        Number of files processed.
181    """
182    count = 0
183    file_like = io.BytesIO(tar_data)
184    for _ in spdl.io.iter_tarfile(file_like):  # pyre-ignore[6]
185        count += 1
186    return count
187
188
189def benchmark(
190    func,
191    tar_data: bytes,
192    num_iterations: int,
193    num_threads: int,
194    num_runs: int = 5,
195) -> tuple[int, float, float, float]:
196    """Benchmark function with specified number of threads.
197
198    Runs multiple benchmark iterations and calculates 95% confidence intervals.
199
200    Args:
201        func: Function to benchmark (e.g., ``process_tar_spdl`` or ``process_tar_builtin``).
202        tar_data: TAR archive as bytes.
203        num_iterations: Number of iterations to run per benchmark run.
204        num_threads: Number of threads to use.
205        num_runs: Number of benchmark runs to perform for confidence interval calculation.
206            Defaults to 5.
207
208    Returns:
209        Tuple of ``(total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci)``.
210    """
211    qps_samples = []
212    last_total_count = 0
213
214    with ThreadPoolExecutor(max_workers=num_threads) as exe:
215        # Warm-up phase: run a few iterations to warm up the executor
216        warmup_futures = [exe.submit(func, tar_data) for _ in range(10 * num_threads)]
217        for future in as_completed(warmup_futures):
218            _ = future.result()
219
220        # Run multiple benchmark iterations
221        for _ in range(num_runs):
222            t0 = time.monotonic()
223            futures = [exe.submit(func, tar_data) for _ in range(num_iterations)]
224            total_count = 0
225            for future in as_completed(futures):
226                total_count += future.result()
227            elapsed = time.monotonic() - t0
228
229            qps = num_iterations / elapsed
230            qps_samples.append(qps)
231            last_total_count = total_count
232
233    # Calculate mean and 95% confidence interval
234    qps_mean = sum(qps_samples) / len(qps_samples)
235    qps_std = np.std(qps_samples, ddof=1)
236    # Using t-distribution critical value for 95% CI
237    # For small samples (n=5), t-value ≈ 2.776
238    t_value = 2.776 if num_runs == 5 else 2.0
239    margin = t_value * qps_std / (num_runs**0.5)
240    qps_lower_ci = qps_mean - margin
241    qps_upper_ci = qps_mean + margin
242
243    return last_total_count, qps_mean, qps_lower_ci, qps_upper_ci
244
245
246def _size_str(n: int) -> str:
247    if n < 1024:
248        return f"{n} B"
249    if n < 1024 * 1024:
250        return f"{n / 1024: .2f} kB"
251    if n < 1024 * 1024 * 1024:
252        return f"{n / (1024 * 1024): .2f} MB"
253    return f"{n / (1024 * 1024 * 1024): .2f} GB"
254
255
256@dataclass
257class BenchmarkResult:
258    """Single benchmark result for a specific configuration."""
259
260    function_name: str
261    "Name of the function being benchmarked."
262    tar_size: int
263    "Size of the TAR archive in bytes."
264    file_size: int
265    "Size of each file in the TAR archive in bytes."
266    num_files: int
267    "Number of files in the TAR archive."
268    num_threads: int
269    "Number of threads used for this benchmark."
270    num_iterations: int
271    "Number of iterations performed."
272    qps_mean: float
273    "Mean queries per second (QPS)."
274    qps_lower_ci: float
275    "Lower bound of 95% confidence interval for QPS."
276    qps_upper_ci: float
277    "Upper bound of 95% confidence interval for QPS."
278    total_files_processed: int
279    "Total number of files processed during the benchmark."
280    python_version: str
281    "Python version used for the benchmark."
282    free_threaded: bool
283    "Whether Python is running with free-threaded ABI (PEP 703)."
284
285
286def create_test_tar(num_files: int, file_size: int) -> bytes:
287    """Create a TAR archive in memory with specified number of files.
288
289    Args:
290        num_files: Number of files to include in the archive.
291        file_size: Size of each file in bytes.
292
293    Returns:
294        TAR archive as bytes.
295    """
296    tar_buffer = io.BytesIO()
297    with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
298        for i in range(num_files):
299            filename = f"file_{i:06d}.txt"
300            content = b"1" * file_size
301            info = tarfile.TarInfo(name=filename)
302            info.size = len(content)
303            tar.addfile(info, io.BytesIO(content))
304    tar_buffer.seek(0)
305    return tar_buffer.getvalue()
306
307
308def run_benchmark(
309    configs: list[tuple[str, Callable[[bytes], int]]],
310    num_files: int,
311    file_sizes: list[int],
312    num_iterations: int,
313    thread_counts: list[int],
314    num_runs: int = 5,
315) -> list[BenchmarkResult]:
316    """Run benchmark comparing SPDL and built-in implementations.
317
318    Tests both :py:func:`spdl.io.iter_tarfile` (with bytes and file-like inputs)
319    and Python's built-in ``tarfile`` module.
320
321    Args:
322        num_files: Number of files in the test TAR archive.
323        file_sizes: List of file sizes to test (in bytes).
324        num_iterations: Number of iterations for each thread count.
325        thread_counts: List of thread counts to test.
326        num_runs: Number of runs to perform for confidence interval calculation.
327            Defaults to 5.
328
329    Returns:
330        List of :py:class:`BenchmarkResult`, one for each configuration tested.
331    """
332
333    results: list[BenchmarkResult] = []
334
335    for file_size in file_sizes:
336        for func_name, func in configs:
337            tar_data = create_test_tar(num_files, file_size)
338            _LG.info(
339                "TAR size: %s (%d x %s), '%s'",
340                _size_str(len(tar_data)),
341                num_files,
342                _size_str(file_size),
343                func_name,
344            )
345
346            for num_threads in thread_counts:
347                total_count, qps_mean, qps_lower_ci, qps_upper_ci = benchmark(
348                    func, tar_data, num_iterations, num_threads, num_runs
349                )
350
351                margin = (qps_upper_ci - qps_lower_ci) / 2
352                _LG.info(
353                    "  Threads: %2d  QPS: %8.2f ± %.2f  (%.2f-%.2f, %d runs, %d files)",
354                    num_threads,
355                    qps_mean,
356                    margin,
357                    qps_lower_ci,
358                    qps_upper_ci,
359                    num_runs,
360                    total_count,
361                )
362
363                python_version, free_threaded = _get_python_info()
364                results.append(
365                    BenchmarkResult(
366                        function_name=func_name,
367                        tar_size=len(tar_data),
368                        file_size=file_size,
369                        num_files=num_files,
370                        num_threads=num_threads,
371                        num_iterations=num_iterations,
372                        qps_mean=qps_mean,
373                        qps_lower_ci=qps_lower_ci,
374                        qps_upper_ci=qps_upper_ci,
375                        total_files_processed=total_count,
376                        python_version=python_version,
377                        free_threaded=free_threaded,
378                    )
379                )
380
381    return results
382
383
384def save_results_to_csv(
385    results: list[BenchmarkResult],
386    output_file: str = "benchmark_tarfile_results.csv",
387) -> None:
388    """Save benchmark results to a CSV file that Excel can open.
389
390    Args:
391        results: List of BenchmarkResult objects containing benchmark data
392        output_file: Output file path for the CSV file
393    """
394    with open(output_file, "w", newline="") as csvfile:
395        fieldnames = [
396            "function_name",
397            "tar_size",
398            "file_size",
399            "num_files",
400            "num_threads",
401            "num_iterations",
402            "qps_mean",
403            "qps_lower_ci",
404            "qps_upper_ci",
405            "total_files_processed",
406            "python_version",
407            "free_threaded",
408        ]
409        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
410
411        writer.writeheader()
412        for r in results:
413            writer.writerow(
414                {
415                    "function_name": r.function_name,
416                    "tar_size": r.tar_size,
417                    "file_size": r.file_size,
418                    "num_files": r.num_files,
419                    "num_threads": r.num_threads,
420                    "num_iterations": r.num_iterations,
421                    "qps_mean": r.qps_mean,
422                    "qps_lower_ci": r.qps_lower_ci,
423                    "qps_upper_ci": r.qps_upper_ci,
424                    "total_files_processed": r.total_files_processed,
425                    "python_version": r.python_version,
426                    "free_threaded": r.free_threaded,
427                }
428            )
429    _LG.info("Results saved to %s", output_file)
430
431
432def _parse_args() -> argparse.Namespace:
433    parser = argparse.ArgumentParser(
434        description="Benchmark iter_tarfile performance with multi-threading"
435    )
436    parser.add_argument(
437        "--num-files",
438        type=int,
439        default=100,
440        help="Number of files in the test TAR archive",
441    )
442    parser.add_argument(
443        "--num-iterations",
444        type=int,
445        default=100,
446        help="Number of iterations for each thread count",
447    )
448    parser.add_argument(
449        "--output",
450        type=str,
451        default="iter_tarfile_benchmark_results.csv",
452        help="Output path for the results",
453    )
454    parser.add_argument(
455        "--log-level",
456        type=str,
457        default="INFO",
458        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
459        help="Logging level",
460    )
461
462    return parser.parse_args()
463
464
465def main() -> None:
466    """Main entry point for the benchmark script.
467
468    Parses command-line arguments, runs benchmarks, and generates plots.
469    """
470
471    args = _parse_args()
472
473    logging.basicConfig(
474        level=getattr(logging, args.log_level),
475        format="%(asctime)s [%(levelname).1s]: %(message)s",
476    )
477
478    thread_counts = [1, 4, 8, 16, 32]
479    file_sizes = [2**8, 2**12, 2**16, 2**20]
480
481    _LG.info("Starting benchmark with configuration:")
482    _LG.info("  Number of files: %d", args.num_files)
483    _LG.info("  File sizes: %s bytes", file_sizes)
484    _LG.info("  Iterations per thread count: %d", args.num_iterations)
485    _LG.info("  Thread counts: %s", thread_counts)
486
487    configs: list[tuple[str, Callable[[bytes], int]]] = [
488        ("1. Python tarfile", process_tar_builtin),
489        ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
490        (
491            "3. SPDL iter_tarfile (bytes w/ convert)",
492            partial(process_tar_spdl, convert=True),
493        ),
494        (
495            "4. SPDL iter_tarfile (bytes w/o convert)",
496            partial(process_tar_spdl, convert=False),
497        ),
498    ]
499
500    results = run_benchmark(
501        configs,
502        num_files=args.num_files,
503        file_sizes=file_sizes,
504        num_iterations=args.num_iterations,
505        thread_counts=thread_counts,
506    )
507
508    # Save results to CSV
509    save_results_to_csv(results, args.output)
510
511    _LG.info(
512        "Benchmark complete. To generate plots, run: "
513        "python plot_tar_benchmark.py --input %s --output %s",
514        args.output,
515        args.output.replace(".csv", ".png"),
516    )
517
518
519if __name__ == "__main__":
520    main()

Functions

Functions

benchmark(func, tar_data: bytes, num_iterations: int, num_threads: int, num_runs: int = 5) tuple[int, float, float, float][source]

Benchmark function with specified number of threads.

Runs multiple benchmark iterations and calculates 95% confidence intervals.

Parameters:
  • func – Function to benchmark (e.g., process_tar_spdl or process_tar_builtin).

  • tar_data – TAR archive as bytes.

  • num_iterations – Number of iterations to run per benchmark run.

  • num_threads – Number of threads to use.

  • num_runs – Number of benchmark runs to perform for confidence interval calculation. Defaults to 5.

Returns:

Tuple of (total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci).

create_test_tar(num_files: int, file_size: int) bytes[source]

Create a TAR archive in memory with specified number of files.

Parameters:
  • num_files – Number of files to include in the archive.

  • file_size – Size of each file in bytes.

Returns:

TAR archive as bytes.

iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]][source]

Iterate over TAR file using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Yields:

Tuple of (filename, content) for each file in the archive.

main() None[source]

Main entry point for the benchmark script.

Parses command-line arguments, runs benchmarks, and generates plots.

save_results_to_csv(results: list[BenchmarkResult], output_file: str = 'benchmark_tarfile_results.csv') None[source]

Save benchmark results to a CSV file that Excel can open.

Parameters:
  • results – List of BenchmarkResult objects containing benchmark data

  • output_file – Output file path for the CSV file

process_tar_builtin(tar_data: bytes) int[source]

Process TAR archive using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl(tar_data: bytes, convert: bool) int[source]

Process TAR archive using spdl.io.iter_tarfile().

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl_filelike(tar_data: bytes) int[source]

Process TAR archive using spdl.io.iter_tarfile() with file-like object.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

run_benchmark(configs: list[tuple[str, Callable[[bytes], int]]], num_files: int, file_sizes: list[int], num_iterations: int, thread_counts: list[int], num_runs: int = 5) list[BenchmarkResult][source]

Run benchmark comparing SPDL and built-in implementations.

Tests both spdl.io.iter_tarfile() (with bytes and file-like inputs) and Python’s built-in tarfile module.

Parameters:
  • num_files – Number of files in the test TAR archive.

  • file_sizes – List of file sizes to test (in bytes).

  • num_iterations – Number of iterations for each thread count.

  • thread_counts – List of thread counts to test.

  • num_runs – Number of runs to perform for confidence interval calculation. Defaults to 5.

Returns:

List of BenchmarkResult, one for each configuration tested.

Classes

Classes

class BenchmarkResult(function_name: str, tar_size: int, file_size: int, num_files: int, num_threads: int, num_iterations: int, qps_mean: float, qps_lower_ci: float, qps_upper_ci: float, total_files_processed: int, python_version: str, free_threaded: bool)[source]

Single benchmark result for a specific configuration.

file_size: int

Size of each file in the TAR archive in bytes.

free_threaded: bool

Whether Python is running with free-threaded ABI (PEP 703).

function_name: str

Name of the function being benchmarked.

num_files: int

Number of files in the TAR archive.

num_iterations: int

Number of iterations performed.

num_threads: int

Number of threads used for this benchmark.

python_version: str

Python version used for the benchmark.

qps_lower_ci: float

Lower bound of 95% confidence interval for QPS.

qps_mean: float

Mean queries per second (QPS).

qps_upper_ci: float

Upper bound of 95% confidence interval for QPS.

tar_size: int

Size of the TAR archive in bytes.

total_files_processed: int

Total number of files processed during the benchmark.