Benchmark tarfile¶
Benchmark script for spdl.io.iter_tarfile() function.
This script benchmarks the performance of iter_tarfile() against
Python’s built-in tarfile module using multi-threading.
Two types of inputs are tested for iter_tarfile().
Byte string and a file-like object returns byte string by chunk.
The benchmark:
Creates test tar archives with various numbers of files
Runs both implementations with different thread counts
Measures queries per second (QPS) for each configuration
Plots the results comparing the three implementations
Example
$ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
# Plot results
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
# Plot results without load_wav
$ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \
--filter '4. SPDL iter_tarfile (bytes w/o convert)'
Result
The following plot shows the QPS (measured by the number of files processed) of each functions with different file size.
The spdl.io.iter_tarfile() function processes data fastest when the input is a byte
string.
Its performance is consistent across different file sizes.
This is because, when the entire TAR file is loaded into memory as a contiguous array,
the function only needs to read the header and return the address of the corresponding data
(note that iter_tarfile() returns a memory view when the input is a byte
string).
Since reading the header is very fast, most of the time is spent creating memory view objects
while holding the GIL (Global Interpreter Lock).
As a result, the speed of loading files decreases as more threads are used.
When the input data type is switched from a byte string to a file-like object,
the performance of spdl.io.iter_tarfile() is also affected by the size of
the input data.
This is because data is processed incrementally, and for each file in the TAR archive,
a new byte string object is created.
The implementation tries to request the exact amount of bytes needed, but file-like objects
do not guarantee that they return the requested length,
instead, they return at most the requested number of bytes.
Therefore, many intermediate byte string objects must be created.
As the file size grows, it takes longer to process the data.
Since the GIL must be locked while byte strings are created,
performance degrades as more threads are used.
At some point, the performance becomes similar to Python’s built-in tarfile module,
which is a pure-Python implementation and thus holds the GIL almost entirely.
Source¶
Source
Click here to see the source.
1#!/usr/bin/env python3
2# Copyright (c) Meta Platforms, Inc. and affiliates.
3# All rights reserved.
4#
5# This source code is licensed under the BSD-style license found in the
6# LICENSE file in the root directory of this source tree.
7
8# pyre-strict
9
10"""Benchmark script for :py:func:`spdl.io.iter_tarfile` function.
11
12This script benchmarks the performance of :py:func:`~spdl.io.iter_tarfile` against
13Python's built-in :py:mod:`tarfile` module using multi-threading.
14Two types of inputs are tested for :py:func:`~spdl.io.iter_tarfile`.
15Byte string and a file-like object returns byte string by chunk.
16
17The benchmark:
18
191. Creates test tar archives with various numbers of files
202. Runs both implementations with different thread counts
213. Measures queries per second (QPS) for each configuration
224. Plots the results comparing the three implementations
23
24**Example**
25
26.. code-block:: shell
27
28 $ numactl --membind 0 --cpubind 0 python benchmark_tarfile.py --output results.csv
29 # Plot results
30 $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot.png
31 # Plot results without load_wav
32 $ python benchmark_tarfile_plot.py --input results.csv --output wav_benchmark_plot_2.png \\
33 --filter '4. SPDL iter_tarfile (bytes w/o convert)'
34
35**Result**
36
37The following plot shows the QPS (measured by the number of files processed) of each
38functions with different file size.
39
40.. image:: ../../_static/data/example_benchmark_tarfile.png
41
42.. image:: ../../_static/data/example_benchmark_tarfile_2.png
43
44The :py:func:`spdl.io.iter_tarfile` function processes data fastest when the input is a byte
45string.
46Its performance is consistent across different file sizes.
47This is because, when the entire TAR file is loaded into memory as a contiguous array,
48the function only needs to read the header and return the address of the corresponding data
49(note that :py:func:`~spdl.io.iter_tarfile` returns a memory view when the input is a byte
50string).
51Since reading the header is very fast, most of the time is spent creating memory view objects
52while holding the GIL (Global Interpreter Lock).
53As a result, the speed of loading files decreases as more threads are used.
54
55When the input data type is switched from a byte string to a file-like object,
56the performance of :py:func:`spdl.io.iter_tarfile` is also affected by the size of
57the input data.
58This is because data is processed incrementally, and for each file in the TAR archive,
59a new byte string object is created.
60The implementation tries to request the exact amount of bytes needed, but file-like objects
61do not guarantee that they return the requested length,
62instead, they return at most the requested number of bytes.
63Therefore, many intermediate byte string objects must be created.
64As the file size grows, it takes longer to process the data.
65Since the GIL must be locked while byte strings are created,
66performance degrades as more threads are used.
67At some point, the performance becomes similar to Python's built-in ``tarfile`` module,
68which is a pure-Python implementation and thus holds the GIL almost entirely.
69"""
70
71__all__ = [
72 "BenchmarkConfig",
73 "BenchmarkResult",
74 "create_test_tar",
75 "iter_tarfile_builtin",
76 "main",
77 "process_tar_builtin",
78 "process_tar_spdl",
79 "process_tar_spdl_filelike",
80]
81
82import argparse
83import io
84import os
85import tarfile
86from collections.abc import Callable, Iterator
87from dataclasses import dataclass
88from functools import partial
89
90import spdl.io
91
92try:
93 from examples.benchmark_utils import ( # pyre-ignore[21]
94 BenchmarkResult,
95 BenchmarkRunner,
96 ExecutorType,
97 get_default_result_path,
98 save_results_to_csv,
99 )
100except ImportError:
101 from spdl.examples.benchmark_utils import (
102 BenchmarkResult,
103 BenchmarkRunner,
104 ExecutorType,
105 get_default_result_path,
106 save_results_to_csv,
107 )
108
109
110DEFAULT_RESULT_PATH: str = get_default_result_path(__file__)
111
112
113@dataclass
114class BenchmarkConfig:
115 """Configuration for a single TAR benchmark run."""
116
117 function_name: str
118 tar_size: int
119 file_size: int
120 num_files: int
121 num_threads: int
122 num_iterations: int
123 total_files_processed: int
124
125
126def iter_tarfile_builtin(tar_data: bytes) -> Iterator[tuple[str, bytes]]:
127 """Iterate over TAR file using Python's built-in ``tarfile`` module.
128
129 Args:
130 tar_data: TAR archive as bytes.
131
132 Yields:
133 Tuple of ``(filename, content)`` for each file in the archive.
134 """
135 with tarfile.open(fileobj=io.BytesIO(tar_data), mode="r") as tar:
136 for member in tar.getmembers():
137 if member.isfile():
138 file_obj = tar.extractfile(member)
139 if file_obj:
140 content = file_obj.read()
141 yield member.name, content
142
143
144def process_tar_spdl(tar_data: bytes, convert: bool) -> int:
145 """Process TAR archive using :py:func:`spdl.io.iter_tarfile`.
146
147 Args:
148 tar_data: TAR archive as bytes.
149
150 Returns:
151 Number of files processed.
152 """
153 count = 0
154 if convert:
155 for _, content in spdl.io.iter_tarfile(tar_data):
156 bytes(content)
157 count += 1
158 return count
159 else:
160 for _ in spdl.io.iter_tarfile(tar_data):
161 count += 1
162 return count
163
164
165def process_tar_builtin(tar_data: bytes) -> int:
166 """Process TAR archive using Python's built-in ``tarfile`` module.
167
168 Args:
169 tar_data: TAR archive as bytes.
170
171 Returns:
172 Number of files processed.
173 """
174 count = 0
175 for _ in iter_tarfile_builtin(tar_data):
176 count += 1
177 return count
178
179
180def process_tar_spdl_filelike(tar_data: bytes) -> int:
181 """Process TAR archive using :py:func:`spdl.io.iter_tarfile` with file-like object.
182
183 Args:
184 tar_data: TAR archive as bytes.
185
186 Returns:
187 Number of files processed.
188 """
189 count = 0
190 file_like = io.BytesIO(tar_data)
191 for _ in spdl.io.iter_tarfile(file_like): # pyre-ignore[6]
192 count += 1
193 return count
194
195
196def _size_str(n: int) -> str:
197 if n < 1024:
198 return f"{n} B"
199 if n < 1024 * 1024:
200 return f"{n / 1024: .2f} kB"
201 if n < 1024 * 1024 * 1024:
202 return f"{n / (1024 * 1024): .2f} MB"
203 return f"{n / (1024 * 1024 * 1024): .2f} GB"
204
205
206def create_test_tar(num_files: int, file_size: int) -> bytes:
207 """Create a TAR archive in memory with specified number of files.
208
209 Args:
210 num_files: Number of files to include in the archive.
211 file_size: Size of each file in bytes.
212
213 Returns:
214 TAR archive as bytes.
215 """
216 tar_buffer = io.BytesIO()
217 with tarfile.open(fileobj=tar_buffer, mode="w") as tar:
218 for i in range(num_files):
219 filename = f"file_{i:06d}.txt"
220 content = b"1" * file_size
221 info = tarfile.TarInfo(name=filename)
222 info.size = len(content)
223 tar.addfile(info, io.BytesIO(content))
224 tar_buffer.seek(0)
225 return tar_buffer.getvalue()
226
227
228def _parse_args() -> argparse.Namespace:
229 """Parse command line arguments.
230
231 Returns:
232 Parsed arguments.
233 """
234 parser = argparse.ArgumentParser(
235 description="Benchmark iter_tarfile performance with multi-threading"
236 )
237 parser.add_argument(
238 "--num-files",
239 type=int,
240 default=100,
241 help="Number of files in the test TAR archive",
242 )
243 parser.add_argument(
244 "--num-iterations",
245 type=int,
246 default=100,
247 help="Number of iterations for each thread count",
248 )
249 parser.add_argument(
250 "--output",
251 type=lambda p: os.path.realpath(p),
252 default=DEFAULT_RESULT_PATH,
253 help="Output path for the results",
254 )
255
256 return parser.parse_args()
257
258
259def main() -> None:
260 """Main entry point for the benchmark script.
261
262 Parses command-line arguments, runs benchmarks, and generates plots.
263 """
264
265 args = _parse_args()
266
267 # Define explicit configuration lists
268 thread_counts = [1, 4, 8, 16, 32]
269 file_sizes = [2**8, 2**12, 2**16, 2**20]
270
271 # Define benchmark function configurations
272 # (function_name, function)
273 benchmark_functions: list[tuple[str, Callable[[bytes], int]]] = [
274 ("1. Python tarfile", process_tar_builtin),
275 ("2. SPDL iter_tarfile (file-like)", process_tar_spdl_filelike),
276 (
277 "3. SPDL iter_tarfile (bytes w/ convert)",
278 partial(process_tar_spdl, convert=True),
279 ),
280 (
281 "4. SPDL iter_tarfile (bytes w/o convert)",
282 partial(process_tar_spdl, convert=False),
283 ),
284 ]
285
286 print("Starting benchmark with configuration:")
287 print(f" Number of files: {args.num_files}")
288 print(f" File sizes: {file_sizes} bytes")
289 print(f" Iterations per thread count: {args.num_iterations}")
290 print(f" Thread counts: {thread_counts}")
291
292 results: list[BenchmarkResult[BenchmarkConfig]] = []
293 num_runs = 5
294
295 for num_threads in thread_counts:
296 with BenchmarkRunner(
297 executor_type=ExecutorType.THREAD,
298 num_workers=num_threads,
299 warmup_iterations=10 * num_threads,
300 ) as runner:
301 for file_size in file_sizes:
302 tar_data = create_test_tar(args.num_files, file_size)
303 for func_name, func in benchmark_functions:
304 print(
305 f"TAR size: {_size_str(len(tar_data))} "
306 f"({args.num_files} x {_size_str(file_size)}), "
307 f"'{func_name}', {num_threads} threads"
308 )
309
310 total_files_processed = args.num_files * args.num_iterations
311
312 config = BenchmarkConfig(
313 function_name=func_name,
314 tar_size=len(tar_data),
315 file_size=file_size,
316 num_files=args.num_files,
317 num_threads=num_threads,
318 num_iterations=args.num_iterations,
319 total_files_processed=total_files_processed,
320 )
321
322 result, _ = runner.run(
323 config,
324 partial(func, tar_data),
325 args.num_iterations,
326 num_runs=num_runs,
327 )
328
329 margin = (result.ci_upper - result.ci_lower) / 2
330 print(
331 f" QPS: {result.qps:8.2f} ± {margin:.2f} "
332 f"({result.ci_lower:.2f}-{result.ci_upper:.2f}, "
333 f"{num_runs} runs, {total_files_processed} files)"
334 )
335
336 results.append(result)
337
338 # Save results to CSV
339 save_results_to_csv(results, args.output)
340
341 print(
342 f"Benchmark complete. To generate plots, run: "
343 f"python benchmark_tarfile_plot.py --input {args.output} "
344 f"--output {args.output.replace('.csv', '.png')}"
345 )
346
347
348if __name__ == "__main__":
349 main()
Functions¶
Functions
- create_test_tar(num_files: int, file_size: int) bytes[source]¶
Create a TAR archive in memory with specified number of files.
- Parameters:
num_files – Number of files to include in the archive.
file_size – Size of each file in bytes.
- Returns:
TAR archive as bytes.
- iter_tarfile_builtin(tar_data: bytes) Iterator[tuple[str, bytes]][source]¶
Iterate over TAR file using Python’s built-in
tarfilemodule.- Parameters:
tar_data – TAR archive as bytes.
- Yields:
Tuple of
(filename, content)for each file in the archive.
- main() None[source]¶
Main entry point for the benchmark script.
Parses command-line arguments, runs benchmarks, and generates plots.
- process_tar_builtin(tar_data: bytes) int[source]¶
Process TAR archive using Python’s built-in
tarfilemodule.- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
- process_tar_spdl(tar_data: bytes, convert: bool) int[source]¶
Process TAR archive using
spdl.io.iter_tarfile().- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
- process_tar_spdl_filelike(tar_data: bytes) int[source]¶
Process TAR archive using
spdl.io.iter_tarfile()with file-like object.- Parameters:
tar_data – TAR archive as bytes.
- Returns:
Number of files processed.
Classes¶
Classes
- class BenchmarkConfig(function_name: str, tar_size: int, file_size: int, num_files: int, num_threads: int, num_iterations: int, total_files_processed: int)[source]¶
Configuration for a single TAR benchmark run.
- class BenchmarkResult[source]¶
Generic benchmark result containing configuration and performance metrics.
This class holds both the benchmark-specific configuration and the common performance statistics. It is parameterized by the config type, which allows each benchmark script to define its own configuration dataclass.
- config: ConfigT¶
Benchmark-specific configuration (e.g., data format, file size, etc.)