SPDL (Scalable and Performant Data Loading)¶
Publications¶
Featured in PyCoder’s Weekly #706.
Resolving Data Starvation with Observability in AI Training (PyTorch Conference, 2025-10-22)
Optimizing Data Loading for Efficient AI Model Training (@Scale: AI & DATA, 2025-06-25)
Scalable and Performant Data Loading (arXiv, 2025-04-23)
Introducing SPDL: Faster AI model training with thread-based data loading (Meta Engineering Blog, 2024-11-22)
Citation¶
Please use the following BibTex for citing our project if you find it useful.
@misc{hira2025scalableperformantdataloading,
title={Scalable and Performant Data Loading},
author={Moto Hira and Christian Puhrsch and Valentin Andrei and Roman Malinovskyy and Gael Le Lan and Abhinandan Krishnan and Joseph Cummings and Miguel Martin and Gokul Gunasekaran and Yuta Inoue and Alex J Turner and Raghuraman Krishnamoorthi},
year={2025},
eprint={2504.20067},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2504.20067},
}
Contents
- Overview
- Installation
- Getting Started
- Introduction to Async I/O
- IO Module
- Optimization Guide
- Case Studies
- Migration Guide
- Best Practices
- Examples
- Image dataloading
- Video dataloading
- Imagenet classification
- Multi thread preprocessing
- Streaming nvdec decoding
- Streaming video processing
- Performance analysis
- Performance simulation
- Hydra integration
- Pipeline definitions
- Pipeline profiling
- Benchmark utils
- Benchmark numpy
- Benchmark wav
- Benchmark tarfile
- Benchmark video
- Frequently Asked Questions
API References