WybeCoder: Verified Imperative Code Generation

Fabian Gloeckle*1,2, Mantas Bakšys*1,3, Darius Feher1,4, Kunhao Zheng1,5, Amaury Hayat2,6, Sean B. Holden3, Gabriel Synnaeve1, Peter O'Hearn1,4

*Equal contribution
1FAIR at Meta   2CERMICS, ENPC, Institut Polytechnique de Paris, CNRS   3Computer Lab, University of Cambridge
4University College London   5Miles, LAMSADE, Université Paris-Dauphine-PSL   6Korea Institute for Advanced Study

Abstract

Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

Results at a Glance

Verina

74.1%
140 / 189 problems solved
128 proved + 12 disproved
Claude Opus 4.5 · 32 turns × 16 agents

Clever-Loom

62.1%
100 / 161 problems solved
Claude Opus 4.5 · 32 turns × 16 agents

Overview

WybeCoder combines SMT-based automatic verification with interactive Lean theorem proving in a hybrid loop. The agent generates imperative code in Velvet—a Dafny-like language embedded in Lean 4 via the Loom framework—annotates it with loop invariants, and iteratively refines both code and proofs using compiler feedback.

WybeCoder: Subgoal decomposition multi-agent system

Figure 1. WybeCoder subgoal decomposition multi-agent system. Starting from a problem specification, the agent generates an implementation and attempts to discharge verification conditions using CVC5. Remaining goals are tackled interactively in Lean, driving iterative implementation refinement or completing the proof.

Two agent strategies are supported:

Code Example

The following implementation of the sum of fourth powers of odd numbers was generated and verified end-to-end by GPT-5 on verina_basic_43. It includes a non-trivial closed-form loop invariant, a separate algebraic identity lemma, and a multi-case proof block dispatched by loom_solve.

method sumOfFourthPowerOfOddNumbers (n : Nat) return (result : Nat) ensures 15 * result = n * (2 * n + 1) * (7 + 24 * n^3 - 12 * n^2 - 14 * n) do -- Accumulate S_i = sum_{k=0}^{i-1} (2k+1)^4 using a loop let mut i : Nat := 0 let mut acc : Nat := 0 while i < n invariant h_i_le : i ≤ n -- Closed-form invariant for the partial sum invariant h_acc_closed : 15 * acc = i * (2 * i - 1) * (2 * i + 1) * (12 * i * i - 7) done_with h_done : n ≤ i decreasing h_dec : n - i do let odd : Nat := 2 * i + 1 acc := acc + odd * odd * odd * odd i := i + 1 return acc lemma mul_factor_identity_nat (n : ℕ) : n * (2 * n - 1) * (2 * n + 1) * (12 * n * n - 7) = n * (2 * n + 1) * (7 + 24 * n ^ 3 - 12 * n ^ 2 - 14 * n) := by sorry -- 82-line proof omitted prove_correct sumOfFourthPowerOfOddNumbers by loom_solve case «ensures» => ( cases i_3 have hi_eq : i = n := Nat.le_antisymm h_i_le h_done have h1 : 15 * acc = n * (2 * n - 1) * (2 * n + 1) * (12 * n * n - 7) := by simpa [hi_eq] using h_acc_closed exact h1.trans (mul_factor_identity_nat n) )

Contributions

  1. Hybrid Software VerificationWe combine SMT solvers with interactive Lean proofs for LLM-based verification, and introduce the first benchmarks tailored to hybrid environments.
  2. Inference Scaling WorkflowsWe study workflows for large-scale inference, identifying subgoal decomposition and goal-directed modification as key components.
  3. Complex Algorithm VerificationOur system verifies challenging algorithms like Heapsort, managing dozens of subgoals and hundreds of lines of code.
  4. Imperativeness JudgeAn LLM-based judge enforces genuine imperative implementations, addressing specification and functional leakage.

Inference Scaling

Performance shows consistent scaling without plateaus across several orders of magnitude of compute budget. We compare sequential agents and subgoal decomposition across frontier models.

Claude Opus 4.5 inference scaling on Verina

(a) Claude 4.5 Opus. Sequential agents vs. subgoal decomposition on Verina.

Gemini 3 Pro inference scaling on Verina

(b) Gemini 3 Pro. Subgoal decomposition outperforms sequential agents for Gemini.

Multi-agent model comparison on Verina

Model comparison. Inference scaling for multi-agent system with different models on Verina, using up to 128 subagents.

Optimal sequential vs parallel compute allocation

Sequential vs. parallel compute. Given a maximum budget of C language model calls, the optimal breakdown into k·T ≤ C with k ≤ 32 attempts and T ≤ 32 turns per attempt.

Multi-agent scalability

Multi-agent scalability. Allocating additional compute to a single subgoal decomposition copy continues to improve performance up to ~1200 model calls, after which pass@2 becomes preferable. For sequential agents the crossover occurs much earlier, at ~22 calls.

Detailed Results

Verina (189 problems)

Budget is represented as turns limit × number of agents launched for each problem. For Sequential Agent, k agents in parallel corresponds to pass@k. Solve rate combines both prove rate and disprove rate. For subgoal decomposition, there is no disproving.

MethodModelTurns × AgentsSolve Rate
BaselineDS Prover V2 7B64 × 120.0%
Sequential AgentGPTOSS-120B16 × 430.2%
Gemini 3 Pro32 × 1655.6%
Claude 4.5 Sonnet32 × 1663.3%
GPT-532 × 1664.6%
Claude 4.5 Opus32 × 1674.1%
Subgoal Decomp.Claude 4.5 Sonnet8 × 12851.9%
GPT-58 × 12857.7%
Gemini 3 Pro8 × 12857.7%
Claude 4.5 Opus8 × 12866.7%
GPT-5 + Goedel Prover8 × 51240.7%

Clever-Loom (161 problems)

MethodModelTurns × AgentsSolve Rate
BaselineCOPRA (Claude 3.7)600s8.7%
Sequential AgentGemini 3 Pro32 × 1632.8%
GPT-532 × 1653.8%
Claude 4.5 Sonnet32 × 1659.6%
Claude 4.5 Opus32 × 1662.1%
Subgoal Decomp.Gemini 3 Pro8 × 12834.2%
Claude 4.5 Sonnet8 × 12841.0%
GPT-58 × 12846.6%
Claude 4.5 Opus8 × 12857.8%

Sorting Algorithm Verification

Sorting algorithms sit at the frontier of what is feasible with moderate compute budgets. The successful verification of Heapsort using 357 sub-agents demonstrates the scalability of our approach. Reynolds (1981) gives a by-hand proof in a section starred to connote difficulty, and later formal treatments would sometimes take up an entire paper.

AlgorithmSub-procedureVerified
Selection Sort
Bubble Sort
Insertion Sort
Binary Insertion Sort
Recursive Quicksort
QuicksortPartition
Step
Sort
HeapsortHeapify
Maxheap
Sort
MergesortMergeruns
Mergepass
Sort

BibTeX

@article{gloeckle2026wybecoder, title = {WybeCoder: Verified Imperative Code Generation}, author = {Gloeckle, Fabian and Bak{\v{s}}ys, Mantas and Feher, Darius and Zheng, Kunhao and Hayat, Amaury and Holden, Sean B. and Synnaeve, Gabriel and O'Hearn, Peter}, journal = {Preprint}, year = {2026} }