Benchmark tarfile¶
Benchmark script for iter_tarfile function.
This script benchmarks the performance of spdl.io.iter_tarfile()
against
Python’s built-in tarfile
module using multi-threading.
Two types of inputs are tested for spdl.io.iter_tarfile()
.
Byte string and a file-like object returns byte string by chunk.
The benchmark:
Creates test tar archives with various numbers of files
Runs both implementations with different thread counts
Measures queries per second (QPS) for each configuration
Plots the results comparing the three implementations
Result
The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.


The spdl.io.iter_tarfile()
function processes data fastest when the input is a byte
string.
Its performance is consistent across different file sizes.
This is because, when the entire TAR file is loaded into memory as a contiguous array,
the function only needs to read the header and return the address of the corresponding data
(note that iter_tarfile()
returns a memory view when the input is a byte
string).
Since reading the header is very fast, most of the time is spent creating memory view objects
while holding the GIL (Global Interpreter Lock).
As a result, the speed of loading files decreases as more threads are used.
When the input data type is switched from a byte string to a file-like object,
the performance of spdl.io.iter_tarfile()
is also affected by the size of
the input data.
This is because data is processed incrementally, and for each file in the TAR archive,
a new byte string object is created.
The implementation tries to request the exact amount of bytes needed, but file-like objects
do not guarantee that they return the requested length,
instead, they return at most the requested number of bytes.
Therefore, many intermediate byte string objects must be created.
As the file size grows, it takes longer to process the data.
Since the GIL must be locked while byte strings are created,
performance degrades as more threads are used.
At some point, the performance becomes similar to Python’s built-in tarfile
module,
which is a pure-Python implementation and thus holds the GIL almost entirely.
Source¶
Source
Click here to see the source.
1#!/usr/bin/env python3
2# Copyright (c) Meta Platforms, Inc. and affiliates.
3# All rights reserved.
4#
5# This source code is licensed under the BSD-style license found in the
6# LICENSE file in the root directory of this source tree.
7
8# pyre-unsafe
9
10"""Benchmark script for iter_tarfile function.
11
12This script benchmarks the performance of :py:func:`spdl.io.iter_tarfile` against
13Python's built-in ``tarfile`` module using multi-threading.
14Two types of inputs are tested for :py:func:`spdl.io.iter_tarfile`.
15Byte string and a file-like object returns byte string by chunk.
16
17The benchmark:
18
191. Creates test tar archives with various numbers of files
202. Runs both implementations with different thread counts
213. Measures queries per second (QPS) for each configuration
224. Plots the results comparing the three implementations
23
24**Result**
25
26The following plot shows the QPS (measured by the number of files processed) of each
27functions with different file size.
28
29.. image:: ../../_static/data/example_benchmark_tarfile.png
30
31.. image:: ../../_static/data/example_benchmark_tarfile_2.png
32
33The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
34string.
35Its performance is consistent across different file sizes.
36This is because, when the entire TAR file is loaded into memory as a contiguous array,
37the function only needs to read the header and return the address of the corresponding data
38(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
39string).
40Since reading the header is very fast, most of the time is spent creating memory view objects
41while holding the GIL (Global Interpreter Lock).
42As a result, the speed of loading files decreases as more threads are used.
43
44When the input data type is switched from a byte string to a file-like object,
45the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
46the input data.
47This is because data is processed incrementally, and for each file in the TAR archive,
48a new byte string object is created.
49The implementation tries to request the exact amount of bytes needed, but file-like objects
50do not guarantee that they return the requested length,
51instead, they return at most the requested number of bytes.
52Therefore, many intermediate byte string objects must be created.
53As the file size grows, it takes longer to process the data.
54Since the GIL must be locked while byte strings are created,
55performance degrades as more threads are used.
56At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
57which is a pure-Python implementation and thus holds the GIL almost entirely.
58"""
59
60__all__ = [
61 "BenchmarkResult",
62 "benchmark",
63 "create_test_tar",
64 "iter_tarfile_builtin",
65 "main",
66 "plot_results",
67 "process_tar_builtin",
68 "process_tar_spdl",
69 "process_tar_spdl_filelike",
70 "run_benchmark",
71]
72
73import argparse
74import io
75import logging
76import os
77import tarfile
78import time
79from collections.abc import Callable, Iterator
80from concurrent.futures import as_completed, ThreadPoolExecutor
81from dataclasses import dataclass
82from functools import partial
83
84import numpy as np
85import spdl.io
86
87_LG = logging.getLogger(__name__)
88
89
90def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
91 """Iterate over TAR file using Python's built-in ``tarfile`` module.
92
93 Args:
94 tar_data: TAR archive as bytes.
95
96 Yields:
97 Tuple of ``(filename, content)`` for each file in the archive.
98 """
99 with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
100 for member in tar.getmembers():
101 if member.isfile():
102 file_obj = tar.extractfile(member)
103 if file_obj:
104 content = file_obj.read()
105 yield member.name, content
106
107
108def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
109 """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
110
111 Args:
112 tar_data: TAR archive as bytes.
113
114 Returns:
115 Number of files processed.
116 """
117 count = 0
118 if convert:
119 for _, content in spdl.io.iter_tarfile(tar_data):
120 bytes(content)
121 count += 1
122 return count
123 else:
124 for _ in spdl.io.iter_tarfile(tar_data):
125 count += 1
126 return count
127
128
129def process_tar_builtin(tar_data: bytes) -> int:
130 """Process TAR archive using Python's built-in ``tarfile`` module.
131
132 Args:
133 tar_data: TAR archive as bytes.
134
135 Returns:
136 Number of files processed.
137 """
138 count = 0
139 for _ in iter_tarfile_builtin(tar_data):
140 count += 1
141 return count
142
143
144def process_tar_spdl_filelike(tar_data: bytes) -> int:
145 """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
146
147 Args:
148 tar_data: TAR archive as bytes.
149
150 Returns:
151 Number of files processed.
152 """
153 count = 0
154 file_like = io.BytesIO(tar_data)
155 for _ in spdl.io.iter_tarfile(file_like): # pyre-ignore[6]
156 count += 1
157 return count
158
159
160def benchmark(
161 func,
162 tar_data: bytes,
163 num_iterations: int,
164 num_threads: int,
165 num_runs: int = 5,
166) -> tuple[int, float, float, float]:
167 """Benchmark function with specified number of threads.
168
169 Runs multiple benchmark iterations and calculates 95% confidence intervals.
170
171 Args:
172 func: Function to benchmark (e.g., ``process_tar_spdl`` or ``process_tar_builtin``).
173 tar_data: TAR archive as bytes.
174 num_iterations: Number of iterations to run per benchmark run.
175 num_threads: Number of threads to use.
176 num_runs: Number of benchmark runs to perform for confidence interval calculation.
177 Defaults to 5.
178
179 Returns:
180 Tuple of ``(total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci)``.
181 """
182 qps_samples = []
183 last_total_count = 0
184
185 with ThreadPoolExecutor(max_workers=num_threads) as exe:
186 # Warm-up phase: run a few iterations to warm up the executor
187 warmup_futures = [exe.submit(func, tar_data) for _ in range(10 * num_threads)]
188 for future in as_completed(warmup_futures):
189 _ = future.result()
190
191 # Run multiple benchmark iterations
192 for _ in range(num_runs):
193 t0 = time.monotonic()
194 futures = [exe.submit(func, tar_data) for _ in range(num_iterations)]
195 total_count = 0
196 for future in as_completed(futures):
197 total_count += future.result()
198 elapsed = time.monotonic() - t0
199
200 qps = num_iterations / elapsed
201 qps_samples.append(qps)
202 last_total_count = total_count
203
204 # Calculate mean and 95% confidence interval
205 qps_mean = sum(qps_samples) / len(qps_samples)
206 qps_std = np.std(qps_samples, ddof=1)
207 # Using t-distribution critical value for 95% CI
208 # For small samples (n=5), t-value ≈ 2.776
209 t_value = 2.776 if num_runs == 5 else 2.0
210 margin = t_value * qps_std / (num_runs**0.5)
211 qps_lower_ci = qps_mean - margin
212 qps_upper_ci = qps_mean + margin
213
214 return last_total_count, qps_mean, qps_lower_ci, qps_upper_ci
215
216
217def _size_str(n: int) -> str:
218 if n < 1024:
219 return f"{n} B"
220 if n < 1024 * 1024:
221 return f"{n / 1024: .2f} kB"
222 if n < 1024 * 1024 * 1024:
223 return f"{n / (1024 * 1024): .2f} MB"
224 return f"{n / (1024 * 1024 * 1024): .2f} GB"
225
226
227@dataclass
228class BenchmarkResult:
229 """Single benchmark result for a specific configuration."""
230
231 function_name: str
232 "Name of the function being benchmarked."
233 tar_size: int
234 "Size of the TAR archive in bytes."
235 file_size: int
236 "Size of each file in the TAR archive in bytes."
237 num_files: int
238 "Number of files in the TAR archive."
239 num_threads: int
240 "Number of threads used for this benchmark."
241 num_iterations: int
242 "Number of iterations performed."
243 qps_mean: float
244 "Mean queries per second (QPS)."
245 qps_lower_ci: float
246 "Lower bound of 95% confidence interval for QPS."
247 qps_upper_ci: float
248 "Upper bound of 95% confidence interval for QPS."
249 total_files_processed: int
250 "Total number of files processed during the benchmark."
251
252
253def create_test_tar(num_files: int, file_size: int) -> bytes:
254 """Create a TAR archive in memory with specified number of files.
255
256 Args:
257 num_files: Number of files to include in the archive.
258 file_size: Size of each file in bytes.
259
260 Returns:
261 TAR archive as bytes.
262 """
263 tar_buffer = io.BytesIO()
264 with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
265 for i in range(num_files):
266 filename = f"file_{i:06d}.txt"
267 content = b"1" * file_size
268 info = tarfile.TarInfo(name=filename)
269 info.size = len(content)
270 tar.addfile(info, io.BytesIO(content))
271 tar_buffer.seek(0)
272 return tar_buffer.getvalue()
273
274
275def run_benchmark(
276 configs: list[tuple[str, Callable[[bytes], int]]],
277 num_files: int,
278 file_sizes: list[int],
279 num_iterations: int,
280 thread_counts: list[int],
281 num_runs: int = 5,
282) -> list[BenchmarkResult]:
283 """Run benchmark comparing SPDL and built-in implementations.
284
285 Tests both :py:func:`spdl.io.iter_tarfile` (with bytes and file-like inputs)
286 and Python's built-in ``tarfile`` module.
287
288 Args:
289 num_files: Number of files in the test TAR archive.
290 file_sizes: List of file sizes to test (in bytes).
291 num_iterations: Number of iterations for each thread count.
292 thread_counts: List of thread counts to test.
293 num_runs: Number of runs to perform for confidence interval calculation.
294 Defaults to 5.
295
296 Returns:
297 List of :py:class:`BenchmarkResult`, one for each configuration tested.
298 """
299
300 results: list[BenchmarkResult] = []
301
302 for file_size in file_sizes:
303 for func_name, func in configs:
304 tar_data = create_test_tar(num_files, file_size)
305 _LG.info(
306 "TAR size: %s (%d x %s), '%s'",
307 _size_str(len(tar_data)),
308 num_files,
309 _size_str(file_size),
310 func_name,
311 )
312
313 for num_threads in thread_counts:
314 total_count, qps_mean, qps_lower_ci, qps_upper_ci = benchmark(
315 func, tar_data, num_iterations, num_threads, num_runs
316 )
317
318 margin = (qps_upper_ci - qps_lower_ci) / 2
319 _LG.info(
320 " Threads: %2d QPS: %8.2f ± %.2f (%.2f-%.2f, %d runs, %d files)",
321 num_threads,
322 qps_mean,
323 margin,
324 qps_lower_ci,
325 qps_upper_ci,
326 num_runs,
327 total_count,
328 )
329
330 results.append(
331 BenchmarkResult(
332 function_name=func_name,
333 tar_size=len(tar_data),
334 file_size=file_size,
335 num_files=num_files,
336 num_threads=num_threads,
337 num_iterations=num_iterations,
338 qps_mean=qps_mean,
339 qps_lower_ci=qps_lower_ci,
340 qps_upper_ci=qps_upper_ci,
341 total_files_processed=total_count,
342 )
343 )
344
345 return results
346
347
348def plot_results(
349 results: list[BenchmarkResult],
350 output_path: str,
351) -> None:
352 """Plot benchmark results with 95% confidence intervals and save to file.
353
354 Creates subplots for each file size tested, showing QPS vs. thread count
355 with shaded confidence interval regions.
356
357 Args:
358 results: List of :py:class:`BenchmarkResult` containing all benchmark data.
359 output_path: Path to save the plot (e.g., ``benchmark_results.png``).
360 """
361 import matplotlib.pyplot as plt
362
363 # Extract unique file sizes and function names
364 file_sizes = sorted({r.file_size for r in results})
365 function_names = sorted({r.function_name for r in results})
366
367 # Create subplots: at most 3 columns, multiple rows if needed
368 num_sizes = len(file_sizes)
369 max_cols = 3
370 num_cols = min(num_sizes, max_cols)
371 num_rows = (num_sizes + max_cols - 1) // max_cols # Ceiling division
372
373 fig, axes = plt.subplots(num_rows, num_cols, figsize=(6 * num_cols, 5 * num_rows))
374
375 # Flatten axes array for easier indexing
376 if num_rows == 1 and num_cols == 1:
377 axes = [axes]
378 elif num_rows == 1 or num_cols == 1:
379 axes = axes.flatten()
380 else:
381 axes = axes.flatten()
382
383 for idx, file_size in enumerate(file_sizes):
384 ax = axes[idx]
385 first_tar_size = 0
386 first_thread_counts = None
387 first_num_files = 0
388
389 for func_name in function_names:
390 # Filter results for this function and file size
391 func_results = [
392 r
393 for r in results
394 if r.function_name == func_name and r.file_size == file_size
395 ]
396
397 if not func_results:
398 continue
399
400 # Sort by thread count
401 func_results.sort(key=lambda r: r.num_threads)
402
403 thread_counts = [r.num_threads for r in func_results]
404 qps_means = [r.qps_mean for r in func_results]
405 qps_lower_cis = [r.qps_lower_ci for r in func_results]
406 qps_upper_cis = [r.qps_upper_ci for r in func_results]
407
408 if first_thread_counts is None:
409 first_tar_size = func_results[0].tar_size
410 first_thread_counts = thread_counts
411 first_num_files = func_results[0].num_files
412
413 ax.plot(
414 thread_counts,
415 qps_means,
416 marker="o",
417 label=func_name,
418 linewidth=2,
419 )
420
421 # Add shaded confidence interval
422 ax.fill_between(
423 thread_counts,
424 qps_lower_cis,
425 qps_upper_cis,
426 alpha=0.2,
427 )
428
429 ax.set_xlabel("Number of Threads", fontsize=11)
430 ax.set_ylabel("Queries Per Second (QPS)", fontsize=11)
431 ax.set_title(
432 f"File Size: {_size_str(first_tar_size)} ({first_num_files} x {_size_str(file_size)})",
433 fontsize=12,
434 )
435 ax.legend(fontsize=9)
436 ax.grid(True, alpha=0.3)
437 ax.set_ylim([0, None])
438 if first_thread_counts:
439 ax.set_xticks(first_thread_counts)
440 ax.set_xticklabels(first_thread_counts)
441
442 # Hide any unused subplots
443 total_subplots = num_rows * num_cols
444 for idx in range(num_sizes, total_subplots):
445 axes[idx].set_visible(False)
446
447 fig.suptitle(
448 "TAR File Parsing Performance: SPDL vs Python tarfile\n(with 95% Confidence Intervals)",
449 fontsize=14,
450 )
451
452 plt.tight_layout()
453 plt.savefig(output_path, dpi=150)
454 _LG.info("Plot saved to: %s", output_path)
455
456
457def _parse_args() -> argparse.Namespace:
458 parser = argparse.ArgumentParser(
459 description="Benchmark iter_tarfile performance with multi-threading"
460 )
461 parser.add_argument(
462 "--num-files",
463 type=int,
464 default=100,
465 help="Number of files in the test TAR archive",
466 )
467 parser.add_argument(
468 "--num-iterations",
469 type=int,
470 default=100,
471 help="Number of iterations for each thread count",
472 )
473 parser.add_argument(
474 "--output",
475 type=str,
476 default="benchmark_tarfile_results.png",
477 help="Output path for the plot",
478 )
479 parser.add_argument(
480 "--log-level",
481 type=str,
482 default="INFO",
483 choices=["DEBUG", "INFO", "WARNING", "ERROR"],
484 help="Logging level",
485 )
486
487 return parser.parse_args()
488
489
490def _suffix(path: str) -> str:
491 p1, p2 = os.path.splitext(path)
492 return f"{p1}_2{p2}"
493
494
495def main() -> None:
496 """Main entry point for the benchmark script.
497
498 Parses command-line arguments, runs benchmarks, and generates plots.
499 """
500
501 args = _parse_args()
502
503 logging.basicConfig(
504 level=getattr(logging, args.log_level),
505 format="%(asctime)s [%(levelname).1s]: %(message)s",
506 )
507
508 thread_counts = [1, 4, 8, 16, 32]
509 file_sizes = [2**8, 2**12, 2**16, 2**20]
510
511 _LG.info("Starting benchmark with configuration:")
512 _LG.info(" Number of files: %d", args.num_files)
513 _LG.info(" File sizes: %s bytes", file_sizes)
514 _LG.info(" Iterations per thread count: %d", args.num_iterations)
515 _LG.info(" Thread counts: %s", thread_counts)
516
517 configs: list[tuple[str, Callable[[bytes], int]]] = [
518 ("1. Python tarfile", process_tar_builtin),
519 ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
520 (
521 "3. SPDL iter_tarfile (bytes w/ convert)",
522 partial(process_tar_spdl, convert=True),
523 ),
524 (
525 "4. SPDL iter_tarfile (bytes w/o convert)",
526 partial(process_tar_spdl, convert=False),
527 ),
528 ]
529
530 results = run_benchmark(
531 configs,
532 num_files=args.num_files,
533 file_sizes=file_sizes,
534 num_iterations=args.num_iterations,
535 thread_counts=thread_counts,
536 )
537
538 plot_results(results, args.output)
539 # plot another for easier view
540 k = configs[-1][0]
541 plot_results(
542 [r for r in results if r.function_name != k],
543 _suffix(args.output),
544 )
545
546
547if __name__ == "__main__":
548 main()
Functions¶
Functions
- benchmark(func, tar_data: bytes, num_iterations: int, num_threads: int, num_runs: int = 5) tuple[int, float, float, float] [source]¶
Benchmark function with specified number of threads.
Runs multiple benchmark iterations and calculates 95% confidence intervals.
- Parameters:
func – Function to benchmark (e.g.,
process_tar_spdl
orprocess_tar_builtin
).tar_data – TAR archive as bytes.
num_iterations – Number of iterations to run per benchmark run.
num_threads – Number of threads to use.
num_runs – Number of benchmark runs to perform for confidence interval calculation. Defaults to 5.
- Returns:
Tuple of
(total_files_processed, qps_mean, qps_lower_ci, qps_upper_ci)
.
- create_test_tar(num_files: int, file_size: int) bytes [source]¶
Create a TAR archive in memory with specified number of files.
- Parameters:
num_files – Number of files to include in the archive.
file_size – Size of each file in bytes.
- Returns:
TAR archive as bytes.
- iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]] [source]¶
Iterate over TAR file using Python’s built-in
tarfile
module.- Parameters:
tar_data – TAR archive as bytes.
- Yields:
Tuple of
(filename, content)
for each file in the archive.
- main() None [source]¶
Main entry point for the benchmark script.
Parses command-line arguments, runs benchmarks, and generates plots.
- plot_results(results: list[BenchmarkResult], output_path: str) None [source]¶
Plot benchmark results with 95% confidence intervals and save to file.
Creates subplots for each file size tested, showing QPS vs. thread count with shaded confidence interval regions.
- Parameters:
results – List of
BenchmarkResult
containing all benchmark data.output_path – Path to save the plot (e.g.,
benchmark_results.png
).
- process_tar_builtin(tar_data: bytes) int [source]¶
Process TAR archive using Python’s built-in
tarfile
module.- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
- process_tar_spdl(tar_data: bytes, convert: bool) int [source]¶
Process TAR archive using
spdl.io.iter_tarfile()
.- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
- process_tar_spdl_filelike(tar_data: bytes) int [source]¶
Process TAR archive using
spdl.io.iter_tarfile()
with file-like object.- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
- run_benchmark(configs: list[tuple[str, Callable[[bytes], int]]], num_files: int, file_sizes: list[int], num_iterations: int, thread_counts: list[int], num_runs: int = 5) list[BenchmarkResult] [source]¶
Run benchmark comparing SPDL and built-in implementations.
Tests both
spdl.io.iter_tarfile()
(with bytes and file-like inputs) and Python’s built-intarfile
module.- Parameters:
num_files – Number of files in the test TAR archive.
file_sizes – List of file sizes to test (in bytes).
num_iterations – Number of iterations for each thread count.
thread_counts – List of thread counts to test.
num_runs – Number of runs to perform for confidence interval calculation. Defaults to 5.
- Returns:
List of
BenchmarkResult
, one for each configuration tested.
Classes¶
Classes