Architecture

Note

This section describes internal implementation details of the autoresearch engine. The architecture may change at any time without notice.

The implementation is organized into several modules with clear boundaries. The goal is to keep scheduling generic, domain logic testable, and infrastructure swappable.

flowchart TB CLI["python cli.py"] Supervisor["Supervisor agent<br>(interactive)"] Engine["AsyncWorkEngine<br>(runner.py)"] Adapter["AutoresearchAdapter<br>(workflow/)"] Policy["policy.py"] Store["store.py"] Platform["AutoresearchPlatform<br>(platform/)"] Agent["_CodingAgent"] CLI --> Supervisor Supervisor --> Engine Engine --> Adapter Adapter --> Policy Adapter --> Store Adapter --> Platform Platform --> Agent

Design Principles

Several design choices cut across the modules and are not obvious from the code alone.

The engine must remain domain-neutral. The async work engine must not grow SPDL, coding-agent, source-control, metrics, hypothesis-planning, or experiment-phase logic. If a behavior depends on what an experiment is, it belongs in the workflow adapter, the policy module, or behind a platform capability — never in the runner.

Stop criteria live in the planner, not the engine. The engine stops when the queue and running set are both empty. Autoresearch enforces its own stopping conditions (plateau patience, max iterations, all best practices tried) by returning no children from the planning step. This keeps the engine simple and avoids a domain-specific should_stop callback.

Resume is phase-based. Each experiment coroutine persists its phase (queued, preparing, running, analyzing, completed, failed) at every meaningful boundary. On resume, the coroutine inspects the persisted phase to skip already-completed steps: a running experiment with a known job ID resumes polling rather than re-launching, and an analyzing experiment skips straight to analysis.

Domain coroutines own their cancellation behavior. The engine cancels asyncio tasks on SIGINT/SIGTERM, but each coroutine decides what state to persist before re-raising CancelledError. Remote jobs are not automatically cancelled by the engine — if remote cancellation is needed, the coroutine or adapter must do it explicitly.

Failures are structured domain data. Every failure path (prepare, build, launch, poll, analyze, plan) produces a FailureRecord with a FailureKind and FailurePhase. The runner never learns about failure kinds. Expected failures flow through _AutoresearchError; unexpected exceptions are caught and wrapped into structured records. This ensures durable accounting even for phases that never reach a remote job.

Async Work Engine

The generic runner (utils/runner.py) knows nothing about SPDL, training jobs, source control, metrics, or hypothesis planning. It operates on serializable _WorkSpec objects and a _WorkAdapter protocol:

  • Maintains a priority queue of pending _WorkSpec objects.

  • Starts up to max_concurrency coroutines via the adapter.

  • Waits for the first coroutine to complete.

  • Passes completed _WorkResult objects (which may contain child specs) back to the adapter and re-queues children.

  • Checkpoints queued and running specs on cancellation.

The runner does not inspect experiment payloads. Infrastructure-specific work belongs in the platform capability layer, and domain decisions belong in the workflow adapter.

Workflow

The autoresearch workflow (utils/workflow/) is the domain side of the boundary. It turns an experiment _WorkSpec into a coroutine that performs the full experiment lifecycle:

  • Restore or prepare the source tree.

  • Apply code changes when the experiment requires a rebuild.

  • Build the image and launch the remote job.

  • Poll for completion and detect stalled jobs.

  • Collect metrics and run coding agent analysis.

  • Record state, master-table rows, findings, and plots.

  • Ask the coding agent for follow-up experiments and return them as child _WorkSpec objects.

The workflow is split into focused modules:

  • adapter.py – the AutoresearchAdapter that implements _WorkAdapter and orchestrates the experiment coroutine.

  • policy.py – deterministic decisions (planning gates, duplicate filtering, stall detection) expressed as pure functions that can be unit tested without infrastructure.

  • store.py – durable state persistence (master table, findings, tree visualization).

  • analysis_ops.py / planning_ops.py / source_ops.py – individual workflow operations that interact with the platform.

Platform Capabilities

The platform layer (utils/platform/) provides a capability boundary between the workflow and infrastructure. AutoresearchPlatform bundles five capability objects:

  • _Workspace – source control operations (detect SCM, commit, restore, check for changes).

  • _Artifacts – image building and tagging.

  • _Execution – job launch, status polling, and cancellation.

  • _Evidence – metrics collection and system profiling.

  • _CodingAgent – stateless coding agent invocations (analysis, planning, code changes).

The workflow can swap local, remote, Claude, Codex, or test implementations by replacing these capability objects without changing any orchestration code.

flowchart LR Workflow["AutoresearchAdapter"] Platform["AutoresearchPlatform"] Workspace["_Workspace"] Artifacts["_Artifacts"] Execution["_Execution"] Evidence["_Evidence"] Agent["_CodingAgent"] Workflow --> Platform Platform --> Workspace Platform --> Artifacts Platform --> Execution Platform --> Evidence Platform --> Agent

Stateless Agent Invocations

Each coding agent call is fully stateless. The workflow constructs a self-contained prompt that includes everything the agent needs: the SPDL optimization knowledge base, the full experiment history, collected metrics, and the pipeline source code. There is no persistent conversation or session state.

This design makes the system robust to interruptions. After Ctrl+C, the engine can resume from the last persisted checkpoint without relying on a conversation session. It also means the coding agent can be swapped between runs (e.g., switching from Claude to Codex) with no state migration.

Hypothesis Tree

Experiments are organized in a tree structure. The seed experiments (baseline, headspace, MTP) are root nodes. Follow-up experiments proposed by the coding agent become children of the node that triggered the planning.

baseline
headspace
mtp
├── gpu_nvdec_decode
│   ├── split_demux_decode
│   │   └── nvdec_c7_optimal
│   └── nvdec_c20_oversub
├── batch_size_16
└── torch_compile

Each node tracks its status (queued, preparing, running, analyzing, completed, failed), the source control commit it was built from, and the analysis results. The tree is owned by the workflow store and visualized as hypothesis_tree.png after each experiment completes.

The following is the hypothesis tree from the video classification example, showing 116 nodes explored across 120 experiments:

Hypothesis tree from video classification optimization