Best Practices¶
Avoid creating intermediate tensors¶
For efficient and performant data processing, it is advised to not create an intermediate Tensor for each individual media object (such as single image), instead create a batch Tensor directly.
We recommend decoding individual frames, then using spdl.io.convert_frames()
to create a batch Tensor directly without creating an intermediate Tensors.
If you are decoding a batch of images, and you have pre-determined set of images
that should go together into a same batch, use
spdl.io.load_image_batch()
(or its async variant
spdl.io.async_load_image_batch()
).
Otherwise, demux, decode and pre-process multiple media, then combine them with
spdl.io.convert_frames()
(or spdl.io.async_convert_frames()
).
For example, the following functions implement decoding and tensor creation
separately.
import spdl.io
from spdl.io import ImageFrames
def decode_image(src: str) -> ImageFrames:
packets = spdl.io.async_demux_image(src)
return spdl.io.async_decode_packets(packets)
def batchify(frames: list[ImageFrames]) -> ImageFrames:
buffer = spdl.io.convert_frames(frames)
return spdl.io.to_torch(buffer)
They can be combined in Pipeline
, which automatically
discards the items failed to process (for example due to invalid data), and
keep the batch size consistent by using other items successfully processed.
from spdl.dataloader import PipelineBuilder
pipeline = (
PipelineBuilder()
.add_source(...)
.pipe(decode_image, concurrency=...)
.aggregate(32)
.pipe(batchify)
.add_sink(3)
.build(num_threads=...)
)
See also
Make Dataset class composable¶
If you are publishing a dataset and providing an implementation of Dataset class, we recommend to make it composable.
That is, in addition to the conventional Dataset
class that
returns Tensors, make the components of the Dataset
implementation available by breaking down the implementation into
Iterator (or map) interface that returns paths instead of Tensors.
A helper function that loads the source path into Tensor.
For example, the interface of a Dataset
for image classification
might look like the following.
class Dataset:
def __getitem__(self, key: int) -> tuple[Tensor, int]:
...
We recommend to separate the source and process and make them additional
public interface.
(Also, as described above, we recommend to not convert each item into
Tensor
for the performance reasons.)
class Source:
def __getitem__(self, key: int) -> tuple[str, int]:
...
def load(data: tuple[str, int]) -> tuple[ImageFrames, int]:
...
and if the processing is composed of stages with different bounding factor, then split them further into primitive functions.
def download(src: tuple[str, int]) -> tuple[bytes, int]:
...
def decode_and_preprocess(data: tuple[bytes, int]) -> tuple[ImageFrames, int]:
...
then the original Dataset
can be implemented as a composition
class Dataset:
def __init__(self, ...):
self._src = Source(...)
def __getitem__(self, key:int) -> tuple[str, int]:
metadata = self._src[key]
item = download(metadata)
frames, cls = decode_and_preprocess(item)
tensor = spdl.io.to_torch(frames)
return tensor, cls
Such decomposition makes the dataset compatible with SPDL’s Pipeline, and allows users to run them more efficiently
pipeline = (
PipelineBuilder()
.add_source(Source(...))
.pipe(download, concurrency=8)
.pipe(decode_and_preprocess, concurrency=4)
...
.build(...)
)