imagenet_classification

Benchmark the performance of loading images from local file systems and classifying them using a GPU.

This script builds the data loading pipeline and instantiates an image classification model in a GPU. The pipeline transfer the batch image data to the GPU concurrently, and the foreground thread run the model on data one by one.

flowchart LR subgraph MP [Main Process] subgraph BG [Background Thread] A[Source] subgraph TP1[Thread Pool] direction LR T1[Thread] T2[Thread] T3[Thread] end end subgraph FG [Main Thread] ML[Main loop] end end subgraph G[GPU] direction TB GM[Memory] T[Transform] M[Model] end A --> T1 -- Batch --> GM A --> T2 -- Batch --> GM A --> T3 -- Batch --> GM ML -.-> GM GM -.-> T -.-> M

A file list can be created, for example, by:

cd /data/users/moto/imagenet/
find val -name '*.JPEG' > ~/imagenet.val.flist

To run the benchmark, pass it to the script like the following.

python imagenet_classification.py
    --input-flist ~/imagenet.val.flist
    --prefix /data/users/moto/imagenet/

Source

Source

Click here to see the source.
  1# Copyright (c) Meta Platforms, Inc. and affiliates.
  2# All rights reserved.
  3#
  4# This source code is licensed under the BSD-style license found in the
  5# LICENSE file in the root directory of this source tree.
  6
  7"""Benchmark the performance of loading images from local file systems and
  8classifying them using a GPU.
  9
 10This script builds the data loading pipeline and instantiates an image
 11classification model in a GPU.
 12The pipeline transfer the batch image data to the GPU concurrently, and
 13the foreground thread run the model on data one by one.
 14
 15.. include:: ../plots/imagenet_classification_chart.txt
 16
 17A file list can be created, for example, by:
 18
 19.. code-block:: bash
 20
 21   cd /data/users/moto/imagenet/
 22   find val -name '*.JPEG' > ~/imagenet.val.flist
 23
 24To run the benchmark,  pass it to the script like the following.
 25
 26.. code-block::
 27
 28   python imagenet_classification.py
 29       --input-flist ~/imagenet.val.flist
 30       --prefix /data/users/moto/imagenet/
 31"""
 32
 33# pyre-ignore-all-errors
 34
 35import contextlib
 36import logging
 37import os.path
 38import re
 39import time
 40from collections.abc import Awaitable, Callable, Iterator
 41from pathlib import Path
 42
 43import spdl.io
 44import spdl.utils
 45import torch
 46from spdl.dataloader import Pipeline, PipelineBuilder
 47from torch import Tensor
 48from torch.profiler import profile
 49
 50_LG = logging.getLogger(__name__)
 51
 52
 53__all__ = [
 54    "entrypoint",
 55    "benchmark",
 56    "source",
 57    "get_decode_func",
 58    "get_pipeline",
 59    "get_model",
 60    "ModelBundle",
 61    "Classification",
 62    "Preprocessing",
 63    "get_mappings",
 64    "parse_wnid",
 65]
 66
 67
 68def _parse_args(args):
 69    import argparse
 70
 71    parser = argparse.ArgumentParser(
 72        description=__doc__,
 73        formatter_class=argparse.RawDescriptionHelpFormatter,
 74    )
 75    parser.add_argument("--debug", action="store_true")
 76    parser.add_argument("--input-flist", type=Path, required=True)
 77    parser.add_argument("--max-samples", type=int, default=float("inf"))
 78    parser.add_argument("--prefix", default="")
 79    parser.add_argument("--batch-size", type=int, default=32)
 80    parser.add_argument("--trace", type=Path)
 81    parser.add_argument("--queue-size", type=int, default=16)
 82    parser.add_argument("--num-threads", type=int, default=16)
 83    parser.add_argument("--no-compile", action="store_false", dest="compile")
 84    parser.add_argument("--no-bf16", action="store_false", dest="use_bf16")
 85    parser.add_argument("--use-nvdec", action="store_true")
 86    parser.add_argument("--use-nvjpeg", action="store_true")
 87    args = parser.parse_args(args)
 88    if args.trace:
 89        args.max_samples = args.batch_size * 60
 90    return args
 91
 92
 93# Handroll the transforms so as to support `torch.compile`
 94class Preprocessing(torch.nn.Module):
 95    """Perform pixel normalization and data type conversion.
 96
 97    Args:
 98        mean: The mean value of the dataset.
 99        std: The standard deviation of the dataset.
100    """
101
102    def __init__(self, mean: Tensor, std: Tensor) -> None:
103        super().__init__()
104        self.register_buffer("mean", mean)
105        self.register_buffer("std", std)
106
107    def forward(self, x: Tensor) -> Tensor:
108        """Normalize the given image batch.
109
110        Args:
111            x: The input image batch. Pixel values are expected to be
112                in the range of ``[0, 255]``.
113        Returns:
114            The normalized image batch.
115        """
116        x = x.float() / 255.0
117        return (x - self.mean) / self.std
118
119
120class Classification(torch.nn.Module):
121    """Classification()"""
122
123    def forward(self, x: Tensor, labels: Tensor) -> tuple[Tensor, Tensor]:
124        """Given a batch of features and labels, compute the top1 and top5 accuracy.
125
126        Args:
127            images: A batch of images. The shape is ``(batch_size, 3, 224, 224)``.
128            labels: A batch of labels. The shape is ``(batch_size,)``.
129
130        Returns:
131            A tuple of top1 and top5 accuracy.
132        """
133
134        probs = torch.nn.functional.softmax(x, dim=-1)
135        top_prob, top_catid = torch.topk(probs, 5)
136        top1 = (top_catid[:, :1] == labels).sum()
137        top5 = (top_catid == labels).sum()
138        return top1, top5
139
140
141class ModelBundle(torch.nn.Module):
142    """ModelBundle()
143
144    Bundle the transform, model backbone, and classification head into a single module
145    for a simple handling."""
146
147    def __init__(self, model, preprocessing, classification, use_bf16):
148        super().__init__()
149        self.model = model
150        self.preprocessing = preprocessing
151        self.classification = classification
152        self.use_bf16 = use_bf16
153
154    def forward(self, images: Tensor, labels: Tensor) -> tuple[Tensor, Tensor]:
155        """Given a batch of images and labels, compute the top1, top5 accuracy.
156
157        Args:
158            images: A batch of images. The shape is ``(batch_size, 3, 224, 224)``.
159            labels: A batch of labels. The shape is ``(batch_size,)``.
160
161        Returns:
162            A tuple of top1 and top5 accuracy.
163        """
164
165        x = self.preprocessing(images)
166
167        if self.use_bf16:
168            x = x.to(torch.bfloat16)
169
170        output = self.model(x)
171
172        return self.classification(output, labels)
173
174
175def _expand(vals, batch_size, res):
176    return torch.tensor(vals).view(1, 3, 1, 1).expand(batch_size, 3, res, res).clone()
177
178
179def get_model(
180    batch_size: int,
181    device_index: int,
182    compile: bool,
183    use_bf16: bool,
184    model_type: str = "mobilenetv3_large_100",
185) -> ModelBundle:
186    """Build computation model, including transfor, model, and classification head.
187
188    Args:
189        batch_size: The batch size of the input.
190        device_index: The index of the target GPU device.
191        compile: Whether to compile the model.
192        use_bf16: Whether to use bfloat16 for the model.
193        model_type: The type of the model. Passed to ``timm.create_model()``.
194
195    Returns:
196        The resulting computation model.
197    """
198    import timm
199
200    device = torch.device(f"cuda:{device_index}")
201
202    model = timm.create_model(model_type, pretrained=True)
203    model = model.eval().to(device=device)
204
205    if use_bf16:
206        model = model.to(dtype=torch.bfloat16)
207
208    preprocessing = Preprocessing(
209        mean=_expand([0.4850, 0.4560, 0.4060], batch_size, 224),
210        std=_expand([0.2290, 0.2240, 0.2250], batch_size, 224),
211    ).to(device)
212
213    classification = Classification().to(device)
214
215    if compile:
216        with torch.no_grad():
217            mode = "max-autotune"
218            model = torch.compile(model, mode=mode)
219            preprocessing = torch.compile(preprocessing, mode=mode)
220
221    return ModelBundle(model, preprocessing, classification, use_bf16)
222
223
224def source(
225    path: Path,
226    prefix: str = "",
227    max_samples: int = float("inf"),
228) -> Iterator[tuple[str, int]]:
229    """Iterate a file containing a list of paths.
230
231    Args:
232        path: Path to the file containing list of file paths.
233        prefix: Prepended to the paths in the list.
234        max_samples: Maximum number of samples to yield.
235
236    Yields:
237        The path of the image and its class label.
238    """
239    class_mapping = get_mappings()
240
241    with open(path) as f:
242        i = 0
243        for line in f:
244            if line := line.strip():
245                path_ = prefix + line
246                label = class_mapping[parse_wnid(path_)]
247                yield path_, label
248                if (i := i + 1) >= max_samples:
249                    return
250
251
252def get_decode_func(
253    device_index: int,
254    width: int = 224,
255    height: int = 224,
256) -> Callable[[list[tuple[str, int]]], Awaitable[tuple[Tensor, Tensor]]]:
257    """Get a function to decode images from a list of paths.
258
259    Args:
260        device_index: The index of the target GPU device.
261        width: The width of the decoded image.
262        height: The height of the decoded image.
263
264    Returns:
265        Async function to decode images in to batch tensor of NCHW format
266        and labels of shape ``(batch_size, 1)``.
267    """
268    device = torch.device(f"cuda:{device_index}")
269
270    filter_desc = spdl.io.get_video_filter_desc(
271        scale_width=256,
272        scale_height=256,
273        crop_width=width,
274        crop_height=height,
275        pix_fmt="rgb24",
276    )
277
278    async def decode_images(items: list[tuple[str, int]]):
279        paths = [item for item, _ in items]
280        labels = [[item] for _, item in items]
281        labels = torch.tensor(labels, dtype=torch.int64).to(device)
282        buffer = await spdl.io.async_load_image_batch(
283            paths,
284            width=None,
285            height=None,
286            pix_fmt=None,
287            strict=True,
288            filter_desc=filter_desc,
289            device_config=spdl.io.cuda_config(
290                device_index=0,
291                allocator=(
292                    torch.cuda.caching_allocator_alloc,
293                    torch.cuda.caching_allocator_delete,
294                ),
295            ),
296        )
297        batch = spdl.io.to_torch(buffer)
298        batch = batch.permute((0, 3, 1, 2))
299        return batch, labels
300
301    return decode_images
302
303
304def _get_experimental_nvjpeg_decode_function(
305    device_index: int,
306    width: int = 224,
307    height: int = 224,
308):
309    device = torch.device(f"cuda:{device_index}")
310    device_config = spdl.io.cuda_config(
311        device_index=device_index,
312        allocator=(
313            torch.cuda.caching_allocator_alloc,
314            torch.cuda.caching_allocator_delete,
315        ),
316    )
317
318    async def decode_images_nvjpeg(items: list[tuple[str, int]]):
319        paths = [item for item, _ in items]
320        labels = [[item] for _, item in items]
321        labels = torch.tensor(labels, dtype=torch.int64).to(device)
322        buffer = await spdl.io.async_load_image_batch_nvjpeg(
323            paths,
324            device_config=device_config,
325            width=width,
326            height=height,
327            pix_fmt="rgb",
328            # strict=True,
329        )
330        batch = spdl.io.to_torch(buffer)
331        return batch, labels
332
333    return decode_images_nvjpeg
334
335
336def _get_experimental_nvdec_decode_function(
337    device_index: int,
338    width: int = 224,
339    height: int = 224,
340):
341    device = torch.device(f"cuda:{device_index}")
342    device_config = spdl.io.cuda_config(
343        device_index=device_index,
344        allocator=(
345            torch.cuda.caching_allocator_alloc,
346            torch.cuda.caching_allocator_delete,
347        ),
348    )
349
350    async def decode_images_nvdec(items: list[tuple[str, int]]):
351        paths = [item for item, _ in items]
352        labels = [[item] for _, item in items]
353        labels = torch.tensor(labels, dtype=torch.int64).to(device)
354        buffer = await spdl.io.async_load_image_batch_nvdec(
355            paths,
356            device_config=device_config,
357            width=width,
358            height=height,
359            pix_fmt="rgba",
360            strict=True,
361        )
362        batch = spdl.io.to_torch(buffer)[:, :-1, :, :]
363        return batch, labels
364
365    return decode_images_nvdec
366
367
368def get_pipeline(
369    src: Iterator[tuple[str, int]],
370    batch_size: int,
371    decode_func: Callable[[list[tuple[str, int]]], Awaitable[tuple[Tensor, Tensor]]],
372    concurrency: int,
373    buffer_size: int,
374    num_threads: int,
375) -> Pipeline:
376    """Build image data loading pipeline.
377
378    The pipeline uses the ``decode_func`` for decoding images concurrently and
379    send the resulting data to GPU.
380
381    Args:
382        src: The source of the data. See :py:func:`source`.
383        batch_size: The number of images in a batch.
384
385    """
386    return (
387        PipelineBuilder()
388        .add_source(src)
389        .aggregate(batch_size, drop_last=True)
390        .pipe(decode_func, concurrency=concurrency)
391        .add_sink(buffer_size)
392        .build(num_threads=num_threads)
393    )
394
395
396def benchmark(dataloader: Iterator[tuple[Tensor, Tensor]], model: ModelBundle) -> None:
397    """The main loop that measures the performance of dataloading and model inference.
398
399    Args:
400        loader: The dataloader to benchmark.
401        model: The model to benchmark.
402    """
403
404    _LG.info("Running inference.")
405    num_frames, num_correct_top1, num_correct_top5 = 0, 0, 0
406    t0 = time.monotonic()
407    try:
408        for i, (batch, labels) in enumerate(dataloader):
409            if i == 20:
410                t0 = time.monotonic()
411                num_frames, num_correct_top1, num_correct_top5 = 0, 0, 0
412
413            with (
414                torch.profiler.record_function(f"iter_{i}"),
415                spdl.utils.trace_event(f"iter_{i}"),
416            ):
417                top1, top5 = model(batch, labels)
418
419                num_frames += batch.shape[0]
420                num_correct_top1 += top1
421                num_correct_top5 += top5
422    finally:
423        elapsed = time.monotonic() - t0
424        if num_frames != 0:
425            num_correct_top1 = num_correct_top1.item()
426            num_correct_top5 = num_correct_top5.item()
427            fps = num_frames / elapsed
428            _LG.info(f"FPS={fps:.2f} ({num_frames}/{elapsed:.2f})")
429            acc1 = 0 if num_frames == 0 else num_correct_top1 / num_frames
430            _LG.info(f"Accuracy (top1)={acc1:.2%} ({num_correct_top1}/{num_frames})")
431            acc5 = 0 if num_frames == 0 else num_correct_top5 / num_frames
432            _LG.info(f"Accuracy (top5)={acc5:.2%} ({num_correct_top5}/{num_frames})")
433
434
435def _get_pipeline(args, device_index) -> Pipeline:
436    src = source(args.input_flist, args.prefix, args.max_samples)
437
438    if args.use_nvjpeg:
439        decode_func = _get_experimental_nvjpeg_decode_function(device_index)
440        concurrency = 7
441    elif args.use_nvdec:
442        decode_func = _get_experimental_nvdec_decode_function(device_index)
443        concurrency = 4
444    else:
445        decode_func = get_decode_func(device_index)
446        concurrency = args.num_threads
447
448    return get_pipeline(
449        src,
450        args.batch_size,
451        decode_func,
452        concurrency,
453        args.queue_size,
454        args.num_threads,
455    )
456
457
458def entrypoint(args: list[int] | None = None):
459    """CLI entrypoint. Run pipeline, transform and model and measure its performance."""
460
461    args = _parse_args(args)
462    _init_logging(args.debug)
463    _LG.info(args)
464
465    device_index = 0
466    model = get_model(args.batch_size, device_index, args.compile, args.use_bf16)
467    pipeline = _get_pipeline(args, device_index)
468
469    print(pipeline)
470
471    trace_path = f"{args.trace}"
472    if args.use_nvjpeg:
473        trace_path = f"{trace_path}.nvjpeg"
474    if args.use_nvdec:
475        trace_path = f"{trace_path}.nvdec"
476
477    with (
478        torch.no_grad(),
479        profile() if args.trace else contextlib.nullcontext() as prof,
480        spdl.utils.tracing(f"{trace_path}.pftrace", enable=args.trace is not None),
481        pipeline.auto_stop(timeout=1),
482    ):
483        benchmark(pipeline.get_iterator(), model)
484
485    if args.trace:
486        prof.export_chrome_trace(f"{trace_path}.json")
487
488
489def _init_logging(debug=False):
490    fmt = "%(asctime)s [%(filename)s:%(lineno)d] [%(levelname)s] %(message)s"
491    level = logging.DEBUG if debug else logging.INFO
492    logging.basicConfig(format=fmt, level=level)
493
494
495def get_mappings() -> dict[str, int]:
496    """Get the mapping from WordNet ID to class and label.
497
498    1000 IDs from ILSVRC2012 is used. The class indices are the index of
499    sorted WordNet ID, which corresponds to most models publicly available.
500
501    Returns:
502        Mapping from WordNet ID to class index.
503
504    Example:
505
506        .. code-block::
507
508           >>> class_mapping = get_mappings()
509           >>> print(class_mapping["n03709823"])
510           636
511
512    """
513    class_mapping = {}
514
515    path = os.path.join(os.path.dirname(__file__), "imagenet_class.tsv")
516    with open(path, mode="r", encoding="utf-8") as f:
517        for line in f:
518            if line := line.strip():
519                class_, wnid = line.split("\t")[:2]
520                class_mapping[wnid] = int(class_)
521    return class_mapping
522
523
524def parse_wnid(s: str):
525    """Parse a WordNet ID (nXXXXXXXX) from string.
526
527    Args:
528        s (str): String to parse
529
530    Returns:
531        (str): Wordnet ID if found otherwise an exception is raised.
532            If the string contain multiple WordNet IDs, the first one is returned.
533    """
534    if match := re.search(r"n\d{8}", s):
535        return match.group(0)
536    raise ValueError(f"The given string does not contain WNID: {s}")
537
538
539if __name__ == "__main__":
540    entrypoint()

Functions

Functions

entrypoint(args: list[int] | None = None)[source]

CLI entrypoint. Run pipeline, transform and model and measure its performance.

benchmark(dataloader: Iterator[tuple[Tensor, Tensor]], model: ModelBundle) None[source]

The main loop that measures the performance of dataloading and model inference.

Parameters:
  • loader – The dataloader to benchmark.

  • model – The model to benchmark.

source(path: Path, prefix: str = '', max_samples: int = inf) Iterator[tuple[str, int]][source]

Iterate a file containing a list of paths.

Parameters:
  • path – Path to the file containing list of file paths.

  • prefix – Prepended to the paths in the list.

  • max_samples – Maximum number of samples to yield.

Yields:

The path of the image and its class label.

get_decode_func(device_index: int, width: int = 224, height: int = 224) Callable[[list[tuple[str, int]]], Awaitable[tuple[Tensor, Tensor]]][source]

Get a function to decode images from a list of paths.

Parameters:
  • device_index – The index of the target GPU device.

  • width – The width of the decoded image.

  • height – The height of the decoded image.

Returns:

Async function to decode images in to batch tensor of NCHW format and labels of shape (batch_size, 1).

get_pipeline(src: Iterator[tuple[str, int]], batch_size: int, decode_func: Callable[[list[tuple[str, int]]], Awaitable[tuple[Tensor, Tensor]]], concurrency: int, buffer_size: int, num_threads: int) Pipeline[source]

Build image data loading pipeline.

The pipeline uses the decode_func for decoding images concurrently and send the resulting data to GPU.

Parameters:
  • src – The source of the data. See source().

  • batch_size – The number of images in a batch.

get_model(batch_size: int, device_index: int, compile: bool, use_bf16: bool, model_type: str = 'mobilenetv3_large_100') ModelBundle[source]

Build computation model, including transfor, model, and classification head.

Parameters:
  • batch_size – The batch size of the input.

  • device_index – The index of the target GPU device.

  • compile – Whether to compile the model.

  • use_bf16 – Whether to use bfloat16 for the model.

  • model_type – The type of the model. Passed to timm.create_model().

Returns:

The resulting computation model.

get_mappings() dict[str, int][source]

Get the mapping from WordNet ID to class and label.

1000 IDs from ILSVRC2012 is used. The class indices are the index of sorted WordNet ID, which corresponds to most models publicly available.

Returns:

Mapping from WordNet ID to class index.

Example

>>> class_mapping = get_mappings()
>>> print(class_mapping["n03709823"])
636
parse_wnid(s: str)[source]

Parse a WordNet ID (nXXXXXXXX) from string.

Parameters:

s (str) – String to parse

Returns:

Wordnet ID if found otherwise an exception is raised.

If the string contain multiple WordNet IDs, the first one is returned.

Return type:

(str)

Classes

Classes

class ModelBundle[source]

Bundle the transform, model backbone, and classification head into a single module for a simple handling.

forward(images: Tensor, labels: Tensor) tuple[Tensor, Tensor][source]

Given a batch of images and labels, compute the top1, top5 accuracy.

Parameters:
  • images – A batch of images. The shape is (batch_size, 3, 224, 224).

  • labels – A batch of labels. The shape is (batch_size,).

Returns:

A tuple of top1 and top5 accuracy.

class Classification[source]
forward(x: Tensor, labels: Tensor) tuple[Tensor, Tensor][source]

Given a batch of features and labels, compute the top1 and top5 accuracy.

Parameters:
  • images – A batch of images. The shape is (batch_size, 3, 224, 224).

  • labels – A batch of labels. The shape is (batch_size,).

Returns:

A tuple of top1 and top5 accuracy.

class Preprocessing(mean: Tensor, std: Tensor)[source]

Perform pixel normalization and data type conversion.

Parameters:
  • mean – The mean value of the dataset.

  • std – The standard deviation of the dataset.

forward(x: Tensor) Tensor[source]

Normalize the given image batch.

Parameters:

x – The input image batch. Pixel values are expected to be in the range of [0, 255].

Returns:

The normalized image batch.