Example: Video Classification¶

This page walks through a complete autoresearch run on the video classification example pipeline in spdl/examples/video_classification. The engine ran for approximately 23 hours, executed 120 experiments, and reduced step time from 426ms to 153ms — a 2.78× speedup.

Setup¶

Pipeline: Video classification training with R3D-18 on Kinetics-400. The baseline data pipeline uses SPDL PipelineBuilder with the following stages:

Sampling → fetch → decode (FFmpeg, CPU) → aggregate → collate → GPU transfer

Hardware: 1×8 H100 GPUs.

Engine configuration:

python run.py /tmp/video_classification_opt \
  --pipeline-script spdl/examples/video_classification/video_classification.py \
  --source-dir spdl/examples/video_classification \
  --build-command "docker build -t my_image ." \
  --base-launch-command "torchx run ... --img \$IMAGE -j 1x8 ..." \
  --max-iterations 20 --patience 5 --max-concurrency 4 --job-timeout 600

The engine ran with up to 4 concurrent experiments and a 10-minute timeout per job. The run was based on commit 3bf24e7.

Baseline¶

The unmodified pipeline produced 426ms/step with 5.75% GPU SM utilization. The headspace analysis (CacheDataLoader) measured a compute floor of 29.8ms/step — 93% of step time was spent waiting for data. Video decoding was the dominant bottleneck by a wide margin.

The MTP (subprocess pipeline) seed experiment failed: the dataset object is not picklable, so the pipeline could not be moved to a subprocess. This meant all optimization had to happen within the main process.

Results¶

The best configuration (run 101_optimal_consolidated) achieved 153ms/step, a 64.1% reduction from baseline. The result was independently confirmed by a reproduction run. The optimized pipeline is available in the repository at commit b1d98c5.

Optimization Breakdown¶
Optimization	Step Time	Cumulative Improvement
Baseline (CPU FFmpeg, concurrency=16)	426ms	—
+ GPU NVDEC decode (concurrency=14)	308ms	↓27.7%
+ Split demux/decode pipeline	280ms	↓34.3%
+ Optimal NVDEC concurrency=7 (1:1 HW slots)	280ms	↓34.3%
+ Subclip 2s temporal windowing	195ms	↓54.2%
+ bf16 autocast + fused optimizer	184ms	↓56.8%
+ Dedicated thread executors	171ms	↓59.9%
+ Fetch concurrency=4 + build threads tuning	170ms	↓60.1%
All consolidated	153ms	↓64.1%

Key Discoveries¶

What worked¶

GPU NVDEC decode — The single biggest win. Replacing CPU FFmpeg with the H100’s dedicated NVDEC hardware video decoders reduced per-item decode time from 0.85s to 0.54s at concurrency=14, and to 0.245s at concurrency=7 with zero hardware contention.
Optimal NVDEC concurrency = 7 — H100 has 7 NVDEC instances. A 1:1 mapping (concurrency=7) gives zero hardware contention. Oversubscription degrades throughput: concurrency=14 doubles per-item latency (0.54s) but doubles parallelism, netting the same 280ms step time. At concurrency=20+, the regression is severe.
Split demux/decode pipeline — Separating CPU demuxing from GPU NVDEC decode into distinct pipeline stages enables CPU-GPU overlap, yielding ~9% improvement over monolithic NVDEC.
Subclip 2s temporal windowing — Limiting video segment length to 2s reduces NVDEC decode work by roughly 2× (0.245s → 0.120s per item). 2.0s is the optimal duration — shorter durations add demux seeking overhead, longer durations increase decode time.
Dedicated thread executors — Outperformed PriorityThreadPoolExecutor by 7% (171ms vs 184ms) by eliminating pool contention between stages. Configuration: 8 threads for demux, 8 for fetch, 7 for NVDEC decode.
bf16 autocast + fused optimizer — Added ~2.5% compute savings. Small individually, but real when stacked with other optimizations.

What did not work¶

Larger batch sizes — Always hurt in decode-bottlenecked pipelines. NVDEC hardware produces a fixed ~28 items/s/rank; larger batches mean fewer batches per second (batch=16: ↓11.5%, batch=32: ↓20.7%).
MTP (subprocess isolation) — The dataset object is fundamentally not picklable. The Tier 2 workaround (callable classes) achieved only 4.2% gain.
NVDEC oversubscription (concurrency > 14) — Per-item decode time scales poorly: concurrency=20 adds +100% latency, concurrency=28 adds +200%, concurrency=35 crashes on init.
torch.compile — Zero steady-state improvement (compute is only 15% of step time) but adds ~76s compilation warmup. mode='reduce-overhead' crashed during CUDA graph capture.
DDP optimization flags (static_graph, gradient_as_bucket_view) — Zero measurable effect. DDP overhead is <1% of step time.
GC management — Crashed on startup 4 consecutive times. When it finally ran, it provided 0% throughput improvement.

Progress¶

The following plot shows experiment duration, step time, SM utilization, and raw SM utilization across all 120 experiments. Green dots mark improvements over the running best. The dashed blue line shows the headspace floor (29.8ms compute-only step time).

Step time dropped sharply in the first 30 experiments as the engine discovered GPU NVDEC decode and the split demux/decode pipeline. Subsequent experiments fine-tuned concurrency, temporal windowing, and executor configuration to push step time from ~280ms down to 153ms.

Hypothesis Tree¶

The full experiment tree shows how the engine explored the optimization space. Each node is an experiment; edges connect parent experiments to follow-ups proposed by the coding agent. Green nodes are kept improvements, gray nodes are completed without improvement, and red nodes are failures.

Starting from three seed experiments (baseline, headspace, MTP), the tree branched into GPU NVDEC decode early — the engine identified CPU video decoding as the dominant bottleneck within the first few experiments. From there, it explored concurrency tuning, pipeline splitting, temporal windowing, and compute optimizations. Failed experiments (red nodes) include NVDEC oversubscription, torch.compile crashes, and GC management attempts. The best path runs through NVDEC decode → split demux/decode → subclip 2s → bf16 autocast → dedicated executors → consolidated optimal.

Remaining Headspace¶

Compute floor (CacheDataLoader)	29.8ms
Best achieved	153ms
Remaining headspace	~80%
Bottleneck	NVDEC hardware decode rate (~28 items/s/rank)

The 153ms result is at the NVDEC hardware throughput ceiling. Further improvement would require a fundamentally different approach to video decoding: a pre-decoded dataset, a codec with faster decode characteristics, multi-node decode distribution, or reduced spatial resolution.

Experiment Statistics¶

Total experiments	120
Completed	89
Failed (job stalled)	5
Failed (runtime error)	17
Failed (planning)	4
Failed (analysis)	1
Kept improvements	12
Total nodes explored	116

The engine ran 120 experiments over approximately 23 hours with up to 4 concurrent jobs. Of the 89 that completed successfully, 12 produced kept improvements — a 13% hit rate on completed experiments. The 31 failures are a natural part of the search: the engine tries aggressive configurations (high NVDEC oversubscription, torch.compile with CUDA graphs, GC tuning) knowing that some will crash. Each failure narrows the search space and informs subsequent planning.