.. _autoresearch-example: Example: Video Classification ============================= This page walks through a complete autoresearch run on the video classification example pipeline in ``spdl/examples/video_classification``. The engine ran for approximately 23 hours, executed 120 experiments, and reduced step time from 426ms to 153ms — a 2.78× speedup. Setup ----- **Pipeline**: Video classification training with R3D-18 on Kinetics-400. The baseline data pipeline uses SPDL PipelineBuilder with the following stages: .. code-block:: text Sampling → fetch → decode (FFmpeg, CPU) → aggregate → collate → GPU transfer **Hardware**: 1×8 H100 GPUs. **Engine configuration**: .. code-block:: bash python run.py /tmp/video_classification_opt \ --pipeline-script spdl/examples/video_classification/video_classification.py \ --source-dir spdl/examples/video_classification \ --build-command "docker build -t my_image ." \ --base-launch-command "torchx run ... --img \$IMAGE -j 1x8 ..." \ --max-iterations 20 --patience 5 --max-concurrency 4 --job-timeout 600 The engine ran with up to 4 concurrent experiments and a 10-minute timeout per job. The run was based on commit `3bf24e7 `_. Baseline -------- The unmodified pipeline produced **426ms/step** with 5.75% GPU SM utilization. The headspace analysis (CacheDataLoader) measured a compute floor of **29.8ms/step** — 93% of step time was spent waiting for data. Video decoding was the dominant bottleneck by a wide margin. The MTP (subprocess pipeline) seed experiment failed: the dataset object is not picklable, so the pipeline could not be moved to a subprocess. This meant all optimization had to happen within the main process. Results ------- The best configuration (run ``101_optimal_consolidated``) achieved **153ms/step**, a **64.1% reduction** from baseline. The result was independently confirmed by a reproduction run. The optimized pipeline is available in the repository at `commit b1d98c5 `_. .. list-table:: Optimization Breakdown :header-rows: 1 :widths: 40 15 20 * - Optimization - Step Time - Cumulative Improvement * - **Baseline** (CPU FFmpeg, concurrency=16) - 426ms - — * - \+ GPU NVDEC decode (concurrency=14) - 308ms - ↓27.7% * - \+ Split demux/decode pipeline - 280ms - ↓34.3% * - \+ Optimal NVDEC concurrency=7 (1:1 HW slots) - 280ms - ↓34.3% * - \+ Subclip 2s temporal windowing - 195ms - ↓54.2% * - \+ bf16 autocast + fused optimizer - 184ms - ↓56.8% * - \+ Dedicated thread executors - 171ms - ↓59.9% * - \+ Fetch concurrency=4 + build threads tuning - 170ms - ↓60.1% * - **All consolidated** - **153ms** - **↓64.1%** Key Discoveries --------------- What worked ~~~~~~~~~~~ 1. **GPU NVDEC decode** — The single biggest win. Replacing CPU FFmpeg with the H100's dedicated NVDEC hardware video decoders reduced per-item decode time from 0.85s to 0.54s at concurrency=14, and to 0.245s at concurrency=7 with zero hardware contention. 2. **Optimal NVDEC concurrency = 7** — H100 has 7 NVDEC instances. A 1:1 mapping (concurrency=7) gives zero hardware contention. Oversubscription degrades throughput: concurrency=14 doubles per-item latency (0.54s) but doubles parallelism, netting the same 280ms step time. At concurrency=20+, the regression is severe. 3. **Split demux/decode pipeline** — Separating CPU demuxing from GPU NVDEC decode into distinct pipeline stages enables CPU-GPU overlap, yielding ~9% improvement over monolithic NVDEC. 4. **Subclip 2s temporal windowing** — Limiting video segment length to 2s reduces NVDEC decode work by roughly 2× (0.245s → 0.120s per item). 2.0s is the optimal duration — shorter durations add demux seeking overhead, longer durations increase decode time. 5. **Dedicated thread executors** — Outperformed ``PriorityThreadPoolExecutor`` by 7% (171ms vs 184ms) by eliminating pool contention between stages. Configuration: 8 threads for demux, 8 for fetch, 7 for NVDEC decode. 6. **bf16 autocast + fused optimizer** — Added ~2.5% compute savings. Small individually, but real when stacked with other optimizations. What did not work ~~~~~~~~~~~~~~~~~ 1. **Larger batch sizes** — Always hurt in decode-bottlenecked pipelines. NVDEC hardware produces a fixed ~28 items/s/rank; larger batches mean fewer batches per second (batch=16: ↓11.5%, batch=32: ↓20.7%). 2. **MTP (subprocess isolation)** — The dataset object is fundamentally not picklable. The Tier 2 workaround (callable classes) achieved only 4.2% gain. 3. **NVDEC oversubscription (concurrency > 14)** — Per-item decode time scales poorly: concurrency=20 adds +100% latency, concurrency=28 adds +200%, concurrency=35 crashes on init. 4. **torch.compile** — Zero steady-state improvement (compute is only 15% of step time) but adds ~76s compilation warmup. ``mode='reduce-overhead'`` crashed during CUDA graph capture. 5. **DDP optimization flags** (``static_graph``, ``gradient_as_bucket_view``) — Zero measurable effect. DDP overhead is <1% of step time. 6. **GC management** — Crashed on startup 4 consecutive times. When it finally ran, it provided 0% throughput improvement. Progress -------- The following plot shows experiment duration, step time, SM utilization, and raw SM utilization across all 120 experiments. Green dots mark improvements over the running best. The dashed blue line shows the headspace floor (29.8ms compute-only step time). .. image:: /_static/data/autoresearch_video_classification_progress.png :alt: Autoresearch progress over 120 experiments :width: 100% Step time dropped sharply in the first 30 experiments as the engine discovered GPU NVDEC decode and the split demux/decode pipeline. Subsequent experiments fine-tuned concurrency, temporal windowing, and executor configuration to push step time from ~280ms down to 153ms. Hypothesis Tree --------------- The full experiment tree shows how the engine explored the optimization space. Each node is an experiment; edges connect parent experiments to follow-ups proposed by the coding agent. Green nodes are kept improvements, gray nodes are completed without improvement, and red nodes are failures. .. image:: /_static/data/autoresearch_video_classification_hypothesis_tree.png :alt: Hypothesis tree for 120 experiments :width: 100% Starting from three seed experiments (baseline, headspace, MTP), the tree branched into GPU NVDEC decode early — the engine identified CPU video decoding as the dominant bottleneck within the first few experiments. From there, it explored concurrency tuning, pipeline splitting, temporal windowing, and compute optimizations. Failed experiments (red nodes) include NVDEC oversubscription, ``torch.compile`` crashes, and GC management attempts. The best path runs through NVDEC decode → split demux/decode → subclip 2s → bf16 autocast → dedicated executors → consolidated optimal. Remaining Headspace ------------------- .. list-table:: :widths: 40 20 * - Compute floor (CacheDataLoader) - 29.8ms * - Best achieved - 153ms * - Remaining headspace - ~80% * - Bottleneck - NVDEC hardware decode rate (~28 items/s/rank) The 153ms result is at the NVDEC hardware throughput ceiling. Further improvement would require a fundamentally different approach to video decoding: a pre-decoded dataset, a codec with faster decode characteristics, multi-node decode distribution, or reduced spatial resolution. Experiment Statistics --------------------- .. list-table:: :widths: 40 15 * - Total experiments - 120 * - Completed - 89 * - Failed (job stalled) - 5 * - Failed (runtime error) - 17 * - Failed (planning) - 4 * - Failed (analysis) - 1 * - Kept improvements - 12 * - Total nodes explored - 116 The engine ran 120 experiments over approximately 23 hours with up to 4 concurrent jobs. Of the 89 that completed successfully, 12 produced kept improvements — a 13% hit rate on completed experiments. The 31 failures are a natural part of the search: the engine tries aggressive configurations (high NVDEC oversubscription, ``torch.compile`` with CUDA graphs, GC tuning) knowing that some will crash. Each failure narrows the search space and informs subsequent planning.