*Equal contribution
1FAIR at Meta
2CERMICS, ENPC, Institut Polytechnique de Paris, CNRS
3Computer Lab, University of Cambridge
4University College London
5Miles, LAMSADE, Université Paris-Dauphine-PSL
6Korea Institute for Advanced Study
Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.
WybeCoder combines SMT-based automatic verification with interactive Lean theorem proving in a hybrid loop. The agent generates imperative code in Velvet—a Dafny-like language embedded in Lean 4 via the Loom framework—annotates it with loop invariants, and iteratively refines both code and proofs using compiler feedback.
Figure 1. WybeCoder subgoal decomposition multi-agent system. Starting from a problem specification, the agent generates an implementation and attempts to discharge verification conditions using CVC5. Remaining goals are tackled interactively in Lean, driving iterative implementation refinement or completing the proof.
Two agent strategies are supported:
The following implementation of the sum of fourth powers of odd numbers was generated and verified end-to-end by GPT-5 on verina_basic_43. It includes a non-trivial closed-form loop invariant, a separate algebraic identity lemma, and a multi-case proof block dispatched by loom_solve.
Performance shows consistent scaling without plateaus across several orders of magnitude of compute budget. We compare sequential agents and subgoal decomposition across frontier models.
(a) Claude 4.5 Opus. Sequential agents vs. subgoal decomposition on Verina.
(b) Gemini 3 Pro. Subgoal decomposition outperforms sequential agents for Gemini.
Model comparison. Inference scaling for multi-agent system with different models on Verina, using up to 128 subagents.
Sequential vs. parallel compute. Given a maximum budget of C language model calls, the optimal breakdown into k·T ≤ C with k ≤ 32 attempts and T ≤ 32 turns per attempt.
Multi-agent scalability. Allocating additional compute to a single subgoal decomposition copy continues to improve performance up to ~1200 model calls, after which pass@2 becomes preferable. For sequential agents the crossover occurs much earlier, at ~22 calls.
Budget is represented as turns limit × number of agents launched for each problem. For Sequential Agent, k agents in parallel corresponds to pass@k. Solve rate combines both prove rate and disprove rate. For subgoal decomposition, there is no disproving.
| Method | Model | Turns × Agents | Solve Rate |
|---|---|---|---|
| Baseline | DS Prover V2 7B | 64 × 1 | 20.0% |
| Sequential Agent | GPTOSS-120B | 16 × 4 | 30.2% |
| Gemini 3 Pro | 32 × 16 | 55.6% | |
| Claude 4.5 Sonnet | 32 × 16 | 63.3% | |
| GPT-5 | 32 × 16 | 64.6% | |
| Claude 4.5 Opus | 32 × 16 | 74.1% | |
| Subgoal Decomp. | Claude 4.5 Sonnet | 8 × 128 | 51.9% |
| GPT-5 | 8 × 128 | 57.7% | |
| Gemini 3 Pro | 8 × 128 | 57.7% | |
| Claude 4.5 Opus | 8 × 128 | 66.7% | |
| GPT-5 + Goedel Prover | 8 × 512 | 40.7% |
| Method | Model | Turns × Agents | Solve Rate |
|---|---|---|---|
| Baseline | COPRA (Claude 3.7) | 600s | 8.7% |
| Sequential Agent | Gemini 3 Pro | 32 × 16 | 32.8% |
| GPT-5 | 32 × 16 | 53.8% | |
| Claude 4.5 Sonnet | 32 × 16 | 59.6% | |
| Claude 4.5 Opus | 32 × 16 | 62.1% | |
| Subgoal Decomp. | Gemini 3 Pro | 8 × 128 | 34.2% |
| Claude 4.5 Sonnet | 8 × 128 | 41.0% | |
| GPT-5 | 8 × 128 | 46.6% | |
| Claude 4.5 Opus | 8 × 128 | 57.8% |
Sorting algorithms sit at the frontier of what is feasible with moderate compute budgets. The successful verification of Heapsort using 357 sub-agents demonstrates the scalability of our approach. Reynolds (1981) gives a by-hand proof in a section starred to connote difficulty, and later formal treatments would sometimes take up an entire paper.
| Algorithm | Sub-procedure | Verified |
|---|---|---|
| Selection Sort | — | ✓ |
| Bubble Sort | — | ✓ |
| Insertion Sort | — | ✓ |
| Binary Insertion Sort | — | ✓ |
| Recursive Quicksort | — | ✗ |
| Quicksort | Partition | ✓ |
| Step | ✗ | |
| Sort | ✓ | |
| Heapsort | Heapify | ✓ |
| Maxheap | ✓ | |
| Sort | ✓ | |
| Mergesort | Mergeruns | ✓ |
| Mergepass | ✗ | |
| Sort | ✓ |