Benchmark tarfile

Benchmark script for iter_tarfile function.

This script benchmarks the performance of spdl.io.iter_tarfile() against Python’s built-in tarfile module using multi-threading. Two types of inputs are tested for spdl.io.iter_tarfile(). Byte string and a file-like object returns byte string by chunk.

The benchmark:

  1. Creates test tar archives with various numbers of files

  2. Runs both implementations with different thread counts

  3. Measures queries per second (QPS) for each configuration

  4. Plots the results comparing the three implementations

Result

The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.

../_static/data/example_benchmark_tarfile.png ../_static/data/example_benchmark_tarfile_2.png

The spdl.io.iter_tarfile() function processes data fastest when the input is a byte string. Its performance is consistent across different file sizes. This is because, when the entire TAR file is loaded into memory as a contiguous array, the function only needs to read the header and return the address of the corresponding data (note that iter_tarfile() returns a memory view when the input is a byte string). Since reading the header is very fast, most of the time is spent creating memory view objects while holding the GIL (Global Interpreter Lock). As a result, the speed of loading files decreases as more threads are used.

When the input data type is switched from a byte string to a file-like object, the performance of spdl.io.iter_tarfile() is also affected by the size of the input data. This is because data is processed incrementally, and for each file in the TAR archive, a new byte string object is created. The implementation tries to request the exact amount of bytes needed, but file-like objects do not guarantee that they return the requested length, instead, they return at most the requested number of bytes. Therefore, many intermediate byte string objects must be created. As the file size grows, it takes longer to process the data. Since the GIL must be locked while byte strings are created, performance degrades as more threads are used. At some point, the performance becomes similar to Python’s built-in tarfile module, which is a pure-Python implementation and thus holds the GIL almost entirely.

Source

Source

Click here to see the source.
  1#!/usr/bin/env python3
  2# Copyright (c) Meta Platforms, Inc. and affiliates.
  3# All rights reserved.
  4#
  5# This source code is licensed under the BSD-style license found in the
  6# LICENSE file in the root directory of this source tree.
  7
  8# pyre-unsafe
  9
 10"""Benchmark script for iter_tarfile function.
 11
 12This script benchmarks the performance of :py:func:`spdl.io.iter_tarfile` against
 13Python's built-in ``tarfile`` module using multi-threading.
 14Two types of inputs are tested for  :py:func:`spdl.io.iter_tarfile`.
 15Byte string and a file-like object returns byte string by chunk.
 16
 17The benchmark:
 18
 191. Creates test tar archives with various numbers of files
 202. Runs both implementations with different thread counts
 213. Measures queries per second (QPS) for each configuration
 224. Plots the results comparing the three implementations
 23
 24**Result**
 25
 26The following plot shows the QPS (measured by the number of files processed) of each
 27functions with different file size.
 28
 29.. image:: ../../_static/data/example_benchmark_tarfile.png
 30
 31.. image:: ../../_static/data/example_benchmark_tarfile_2.png
 32
 33The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
 34string.
 35Its performance is consistent across different file sizes.
 36This is because, when the entire TAR file is loaded into memory as a contiguous array,
 37the function only needs to read the header and return the address of the corresponding data
 38(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
 39string).
 40Since reading the header is very fast, most of the time is spent creating memory view objects
 41while holding the GIL (Global Interpreter Lock).
 42As a result, the speed of loading files decreases as more threads are used.
 43
 44When the input data type is switched from a byte string to a file-like object,
 45the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
 46the input data.
 47This is because data is processed incrementally, and for each file in the TAR archive,
 48a new byte string object is created.
 49The implementation tries to request the exact amount of bytes needed, but file-like objects
 50do not guarantee that they return the requested length,
 51instead, they return at most the requested number of bytes.
 52Therefore, many intermediate byte string objects must be created.
 53As the file size grows, it takes longer to process the data.
 54Since the GIL must be locked while byte strings are created,
 55performance degrades as more threads are used.
 56At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
 57which is a pure-Python implementation and thus holds the GIL almost entirely.
 58"""
 59
 60__all__ = [
 61    "BenchmarkResult",
 62    "benchmark",
 63    "create_test_tar",
 64    "iter_tarfile_builtin",
 65    "main",
 66    "plot_results",
 67    "process_tar_builtin",
 68    "process_tar_spdl",
 69    "process_tar_spdl_filelike",
 70    "run_benchmark",
 71]
 72
 73import argparse
 74import io
 75import logging
 76import os
 77import tarfile
 78import time
 79from collections.abc import Callable, Iterator
 80from concurrent.futures import as_completed, ThreadPoolExecutor
 81from dataclasses import dataclass
 82from functools import partial
 83
 84import numpy as np
 85import spdl.io
 86
 87_LG = logging.getLogger(__name__)
 88
 89
 90def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
 91    """Iterate over TAR file using Python's built-in ``tarfile`` module.
 92
 93    Args:
 94        tar_data: TAR archive as bytes.
 95
 96    Yields:
 97        Tuple of ``(filename, content)`` for each file in the archive.
 98    """
 99    with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
100        for member in tar.getmembers():
101            if member.isfile():
102                file_obj = tar.extractfile(member)
103                if file_obj:
104                    content = file_obj.read()
105                    yield member.name, content
106
107
108def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
109    """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
110
111    Args:
112        tar_data: TAR archive as bytes.
113
114    Returns:
115        Number of files processed.
116    """
117    count = 0
118    if convert:
119        for _, content in spdl.io.iter_tarfile(tar_data):
120            bytes(content)
121            count += 1
122        return count
123    else:
124        for _ in spdl.io.iter_tarfile(tar_data):
125            count += 1
126        return count
127
128
129def process_tar_builtin(tar_data: bytes) -> int:
130    """Process TAR archive using Python's built-in ``tarfile`` module.
131
132    Args:
133        tar_data: TAR archive as bytes.
134
135    Returns:
136        Number of files processed.
137    """
138    count = 0
139    for _ in iter_tarfile_builtin(tar_data):
140        count += 1
141    return count
142
143
144def process_tar_spdl_filelike(tar_data: bytes) -> int:
145    """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
146
147    Args:
148        tar_data: TAR archive as bytes.
149
150    Returns:
151        Number of files processed.
152    """
153    count = 0
154    file_like = io.BytesIO(tar_data)
155    for _ in spdl.io.iter_tarfile(file_like):  # pyre-ignore[6]
156        count += 1
157    return count
158
159
160def benchmark(
161    func,
162    tar_data: bytes,
163    num_iterations: int,
164    num_threads: int,
165    num_runs: int = 5,
166) -> tuple[int, float, float, float]:
167    """Benchmark function with specified number of threads.
168
169    Runs multiple benchmark iterations and calculates 95% confidence intervals.
170
171    Args:
172        func: Function to benchmark (e.g., ``process_tar_spdl`` or ``process_tar_builtin``).
173        tar_data: TAR archive as bytes.
174        num_iterations: Number of iterations to run per benchmark run.
175        num_threads: Number of threads to use.
176        num_runs: Number of benchmark runs to perform for confidence interval calculation.
177            Defaults to 5.
178
179    Returns:
180        Tuple of ``(total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci)``.
181    """
182    qps_samples = []
183    last_total_count = 0
184
185    with ThreadPoolExecutor(max_workers=num_threads) as exe:
186        # Warm-up phase: run a few iterations to warm up the executor
187        warmup_futures = [exe.submit(func, tar_data) for _ in range(10 * num_threads)]
188        for future in as_completed(warmup_futures):
189            _ = future.result()
190
191        # Run multiple benchmark iterations
192        for _ in range(num_runs):
193            t0 = time.monotonic()
194            futures = [exe.submit(func, tar_data) for _ in range(num_iterations)]
195            total_count = 0
196            for future in as_completed(futures):
197                total_count += future.result()
198            elapsed = time.monotonic() - t0
199
200            qps = num_iterations / elapsed
201            qps_samples.append(qps)
202            last_total_count = total_count
203
204    # Calculate mean and 95% confidence interval
205    qps_mean = sum(qps_samples) / len(qps_samples)
206    qps_std = np.std(qps_samples, ddof=1)
207    # Using t-distribution critical value for 95% CI
208    # For small samples (n=5), t-value ≈ 2.776
209    t_value = 2.776 if num_runs == 5 else 2.0
210    margin = t_value * qps_std / (num_runs**0.5)
211    qps_lower_ci = qps_mean - margin
212    qps_upper_ci = qps_mean + margin
213
214    return last_total_count, qps_mean, qps_lower_ci, qps_upper_ci
215
216
217def _size_str(n: int) -> str:
218    if n < 1024:
219        return f"{n} B"
220    if n < 1024 * 1024:
221        return f"{n / 1024: .2f} kB"
222    if n < 1024 * 1024 * 1024:
223        return f"{n / (1024 * 1024): .2f} MB"
224    return f"{n / (1024 * 1024 * 1024): .2f} GB"
225
226
227@dataclass
228class BenchmarkResult:
229    """Single benchmark result for a specific configuration."""
230
231    function_name: str
232    "Name of the function being benchmarked."
233    tar_size: int
234    "Size of the TAR archive in bytes."
235    file_size: int
236    "Size of each file in the TAR archive in bytes."
237    num_files: int
238    "Number of files in the TAR archive."
239    num_threads: int
240    "Number of threads used for this benchmark."
241    num_iterations: int
242    "Number of iterations performed."
243    qps_mean: float
244    "Mean queries per second (QPS)."
245    qps_lower_ci: float
246    "Lower bound of 95% confidence interval for QPS."
247    qps_upper_ci: float
248    "Upper bound of 95% confidence interval for QPS."
249    total_files_processed: int
250    "Total number of files processed during the benchmark."
251
252
253def create_test_tar(num_files: int, file_size: int) -> bytes:
254    """Create a TAR archive in memory with specified number of files.
255
256    Args:
257        num_files: Number of files to include in the archive.
258        file_size: Size of each file in bytes.
259
260    Returns:
261        TAR archive as bytes.
262    """
263    tar_buffer = io.BytesIO()
264    with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
265        for i in range(num_files):
266            filename = f"file_{i:06d}.txt"
267            content = b"1" * file_size
268            info = tarfile.TarInfo(name=filename)
269            info.size = len(content)
270            tar.addfile(info, io.BytesIO(content))
271    tar_buffer.seek(0)
272    return tar_buffer.getvalue()
273
274
275def run_benchmark(
276    configs: list[tuple[str, Callable[[bytes], int]]],
277    num_files: int,
278    file_sizes: list[int],
279    num_iterations: int,
280    thread_counts: list[int],
281    num_runs: int = 5,
282) -> list[BenchmarkResult]:
283    """Run benchmark comparing SPDL and built-in implementations.
284
285    Tests both :py:func:`spdl.io.iter_tarfile` (with bytes and file-like inputs)
286    and Python's built-in ``tarfile`` module.
287
288    Args:
289        num_files: Number of files in the test TAR archive.
290        file_sizes: List of file sizes to test (in bytes).
291        num_iterations: Number of iterations for each thread count.
292        thread_counts: List of thread counts to test.
293        num_runs: Number of runs to perform for confidence interval calculation.
294            Defaults to 5.
295
296    Returns:
297        List of :py:class:`BenchmarkResult`, one for each configuration tested.
298    """
299
300    results: list[BenchmarkResult] = []
301
302    for file_size in file_sizes:
303        for func_name, func in configs:
304            tar_data = create_test_tar(num_files, file_size)
305            _LG.info(
306                "TAR size: %s (%d x %s), '%s'",
307                _size_str(len(tar_data)),
308                num_files,
309                _size_str(file_size),
310                func_name,
311            )
312
313            for num_threads in thread_counts:
314                total_count, qps_mean, qps_lower_ci, qps_upper_ci = benchmark(
315                    func, tar_data, num_iterations, num_threads, num_runs
316                )
317
318                margin = (qps_upper_ci - qps_lower_ci) / 2
319                _LG.info(
320                    "  Threads: %2d  QPS: %8.2f ± %.2f  (%.2f-%.2f, %d runs, %d files)",
321                    num_threads,
322                    qps_mean,
323                    margin,
324                    qps_lower_ci,
325                    qps_upper_ci,
326                    num_runs,
327                    total_count,
328                )
329
330                results.append(
331                    BenchmarkResult(
332                        function_name=func_name,
333                        tar_size=len(tar_data),
334                        file_size=file_size,
335                        num_files=num_files,
336                        num_threads=num_threads,
337                        num_iterations=num_iterations,
338                        qps_mean=qps_mean,
339                        qps_lower_ci=qps_lower_ci,
340                        qps_upper_ci=qps_upper_ci,
341                        total_files_processed=total_count,
342                    )
343                )
344
345    return results
346
347
348def plot_results(
349    results: list[BenchmarkResult],
350    output_path: str,
351) -> None:
352    """Plot benchmark results with 95% confidence intervals and save to file.
353
354    Creates subplots for each file size tested, showing QPS vs. thread count
355    with shaded confidence interval regions.
356
357    Args:
358        results: List of :py:class:`BenchmarkResult` containing all benchmark data.
359        output_path: Path to save the plot (e.g., ``benchmark_results.png``).
360    """
361    import matplotlib.pyplot as plt
362
363    # Extract unique file sizes and function names
364    file_sizes = sorted({r.file_size for r in results})
365    function_names = sorted({r.function_name for r in results})
366
367    # Create subplots: at most 3 columns, multiple rows if needed
368    num_sizes = len(file_sizes)
369    max_cols = 3
370    num_cols = min(num_sizes, max_cols)
371    num_rows = (num_sizes + max_cols - 1) // max_cols  # Ceiling division
372
373    fig, axes = plt.subplots(num_rows, num_cols, figsize=(6 * num_cols, 5 * num_rows))
374
375    # Flatten axes array for easier indexing
376    if num_rows == 1 and num_cols == 1:
377        axes = [axes]
378    elif num_rows == 1 or num_cols == 1:
379        axes = axes.flatten()
380    else:
381        axes = axes.flatten()
382
383    for idx, file_size in enumerate(file_sizes):
384        ax = axes[idx]
385        first_tar_size = 0
386        first_thread_counts = None
387        first_num_files = 0
388
389        for func_name in function_names:
390            # Filter results for this function and file size
391            func_results = [
392                r
393                for r in results
394                if r.function_name == func_name and r.file_size == file_size
395            ]
396
397            if not func_results:
398                continue
399
400            # Sort by thread count
401            func_results.sort(key=lambda r: r.num_threads)
402
403            thread_counts = [r.num_threads for r in func_results]
404            qps_means = [r.qps_mean for r in func_results]
405            qps_lower_cis = [r.qps_lower_ci for r in func_results]
406            qps_upper_cis = [r.qps_upper_ci for r in func_results]
407
408            if first_thread_counts is None:
409                first_tar_size = func_results[0].tar_size
410                first_thread_counts = thread_counts
411                first_num_files = func_results[0].num_files
412
413            ax.plot(
414                thread_counts,
415                qps_means,
416                marker="o",
417                label=func_name,
418                linewidth=2,
419            )
420
421            # Add shaded confidence interval
422            ax.fill_between(
423                thread_counts,
424                qps_lower_cis,
425                qps_upper_cis,
426                alpha=0.2,
427            )
428
429        ax.set_xlabel("Number of Threads", fontsize=11)
430        ax.set_ylabel("Queries Per Second (QPS)", fontsize=11)
431        ax.set_title(
432            f"File Size: {_size_str(first_tar_size)} ({first_num_files} x {_size_str(file_size)})",
433            fontsize=12,
434        )
435        ax.legend(fontsize=9)
436        ax.grid(True, alpha=0.3)
437        ax.set_ylim([0, None])
438        if first_thread_counts:
439            ax.set_xticks(first_thread_counts)
440            ax.set_xticklabels(first_thread_counts)
441
442    # Hide any unused subplots
443    total_subplots = num_rows * num_cols
444    for idx in range(num_sizes, total_subplots):
445        axes[idx].set_visible(False)
446
447    fig.suptitle(
448        "TAR File Parsing Performance: SPDL vs Python tarfile\n(with 95% Confidence Intervals)",
449        fontsize=14,
450    )
451
452    plt.tight_layout()
453    plt.savefig(output_path, dpi=150)
454    _LG.info("Plot saved to: %s", output_path)
455
456
457def _parse_args() -> argparse.Namespace:
458    parser = argparse.ArgumentParser(
459        description="Benchmark iter_tarfile performance with multi-threading"
460    )
461    parser.add_argument(
462        "--num-files",
463        type=int,
464        default=100,
465        help="Number of files in the test TAR archive",
466    )
467    parser.add_argument(
468        "--num-iterations",
469        type=int,
470        default=100,
471        help="Number of iterations for each thread count",
472    )
473    parser.add_argument(
474        "--output",
475        type=str,
476        default="benchmark_tarfile_results.png",
477        help="Output path for the plot",
478    )
479    parser.add_argument(
480        "--log-level",
481        type=str,
482        default="INFO",
483        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
484        help="Logging level",
485    )
486
487    return parser.parse_args()
488
489
490def _suffix(path: str) -> str:
491    p1, p2 = os.path.splitext(path)
492    return f"{p1}_2{p2}"
493
494
495def main() -> None:
496    """Main entry point for the benchmark script.
497
498    Parses command-line arguments, runs benchmarks, and generates plots.
499    """
500
501    args = _parse_args()
502
503    logging.basicConfig(
504        level=getattr(logging, args.log_level),
505        format="%(asctime)s [%(levelname).1s]: %(message)s",
506    )
507
508    thread_counts = [1, 4, 8, 16, 32]
509    file_sizes = [2**8, 2**12, 2**16, 2**20]
510
511    _LG.info("Starting benchmark with configuration:")
512    _LG.info("  Number of files: %d", args.num_files)
513    _LG.info("  File sizes: %s bytes", file_sizes)
514    _LG.info("  Iterations per thread count: %d", args.num_iterations)
515    _LG.info("  Thread counts: %s", thread_counts)
516
517    configs: list[tuple[str, Callable[[bytes], int]]] = [
518        ("1. Python tarfile", process_tar_builtin),
519        ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
520        (
521            "3. SPDL iter_tarfile (bytes w/ convert)",
522            partial(process_tar_spdl, convert=True),
523        ),
524        (
525            "4. SPDL iter_tarfile (bytes w/o convert)",
526            partial(process_tar_spdl, convert=False),
527        ),
528    ]
529
530    results = run_benchmark(
531        configs,
532        num_files=args.num_files,
533        file_sizes=file_sizes,
534        num_iterations=args.num_iterations,
535        thread_counts=thread_counts,
536    )
537
538    plot_results(results, args.output)
539    # plot another for easier view
540    k = configs[-1][0]
541    plot_results(
542        [r for r in results if r.function_name != k],
543        _suffix(args.output),
544    )
545
546
547if __name__ == "__main__":
548    main()

Functions

Functions

benchmark(func, tar_data: bytes, num_iterations: int, num_threads: int, num_runs: int = 5) tuple[int, float, float, float][source]

Benchmark function with specified number of threads.

Runs multiple benchmark iterations and calculates 95% confidence intervals.

Parameters:
  • func – Function to benchmark (e.g., process_tar_spdl or process_tar_builtin).

  • tar_data – TAR archive as bytes.

  • num_iterations – Number of iterations to run per benchmark run.

  • num_threads – Number of threads to use.

  • num_runs – Number of benchmark runs to perform for confidence interval calculation. Defaults to 5.

Returns:

Tuple of (total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci).

create_test_tar(num_files: int, file_size: int) bytes[source]

Create a TAR archive in memory with specified number of files.

Parameters:
  • num_files – Number of files to include in the archive.

  • file_size – Size of each file in bytes.

Returns:

TAR archive as bytes.

iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]][source]

Iterate over TAR file using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Yields:

Tuple of (filename, content) for each file in the archive.

main() None[source]

Main entry point for the benchmark script.

Parses command-line arguments, runs benchmarks, and generates plots.

plot_results(results: list[BenchmarkResult], output_path: str) None[source]

Plot benchmark results with 95% confidence intervals and save to file.

Creates subplots for each file size tested, showing QPS vs. thread count with shaded confidence interval regions.

Parameters:
  • results – List of BenchmarkResult containing all benchmark data.

  • output_path – Path to save the plot (e.g., benchmark_results.png).

process_tar_builtin(tar_data: bytes) int[source]

Process TAR archive using Python’s built-in tarfile module.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl(tar_data: bytes, convert: bool) int[source]

Process TAR archive using spdl.io.iter_tarfile().

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

process_tar_spdl_filelike(tar_data: bytes) int[source]

Process TAR archive using spdl.io.iter_tarfile() with file-like object.

Parameters:

tar_data – TAR archive as bytes.

Returns:

Number of files processed.

run_benchmark(configs: list[tuple[str, Callable[[bytes], int]]], num_files: int, file_sizes: list[int], num_iterations: int, thread_counts: list[int], num_runs: int = 5) list[BenchmarkResult][source]

Run benchmark comparing SPDL and built-in implementations.

Tests both spdl.io.iter_tarfile() (with bytes and file-like inputs) and Python’s built-in tarfile module.

Parameters:
  • num_files – Number of files in the test TAR archive.

  • file_sizes – List of file sizes to test (in bytes).

  • num_iterations – Number of iterations for each thread count.

  • thread_counts – List of thread counts to test.

  • num_runs – Number of runs to perform for confidence interval calculation. Defaults to 5.

Returns:

List of BenchmarkResult, one for each configuration tested.

Classes

Classes

class BenchmarkResult(function_name: str, tar_size: int, file_size: int, num_files: int, num_threads: int, num_iterations: int, qps_mean: float, qps_lower_ci: float, qps_upper_ci: float, total_files_processed: int)[source]

Single benchmark result for a specific configuration.

file_size: int

Size of each file in the TAR archive in bytes.

function_name: str

Name of the function being benchmarked.

num_files: int

Number of files in the TAR archive.

num_iterations: int

Number of iterations performed.

num_threads: int

Number of threads used for this benchmark.

qps_lower_ci: float

Lower bound of 95% confidence interval for QPS.

qps_mean: float

Mean queries per second (QPS).

qps_upper_ci: float

Upper bound of 95% confidence interval for QPS.

tar_size: int

Size of the TAR archive in bytes.

total_files_processed: int

Total number of files processed during the benchmark.