WybeCoder: Verified Imperative Code Generation

Fabian Gloeckle^*1,2, Mantas Bakšys^*1,3, Darius Feher^1,4, Kunhao Zheng^1,5, Amaury Hayat^2,6, Sean B. Holden³, Gabriel Synnaeve¹, Peter O'Hearn^1,4

^*Equal contribution
¹FAIR at Meta ²CERMICS, ENPC, Institut Polytechnique de Paris, CNRS ³Computer Lab, University of Cambridge
⁴University College London ⁵Miles, LAMSADE, Université Paris-Dauphine-PSL ⁶Korea Institute for Advanced Study

📄 Paper 💻 Code 🔎 Trajectory Viewer

Abstract

Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

Results at a Glance

Verina

74.1%

140 / 189 problems solved
128 proved + 12 disproved

Claude Opus 4.5 · 32 turns × 16 agents

Clever-Loom

62.1%

100 / 161 problems solved

Claude Opus 4.5 · 32 turns × 16 agents

Overview

WybeCoder combines SMT-based automatic verification with interactive Lean theorem proving in a hybrid loop. The agent generates imperative code in Velvet—a Dafny-like language embedded in Lean 4 via the Loom framework—annotates it with loop invariants, and iteratively refines both code and proofs using compiler feedback.

Figure 1. WybeCoder subgoal decomposition multi-agent system. Starting from a problem specification, the agent generates an implementation and attempts to discharge verification conditions using CVC5. Remaining goals are tackled interactively in Lean, driving iterative implementation refinement or completing the proof.

Two agent strategies are supported:

Sequential Agent — Single-agent turn-based loop with iterative refinement. Multiple independent attempts run in parallel (pass@k).
Subgoal Decomposition — Extracts verification subgoals, dispatches parallel provers, and reconstructs the full proof. Supports conflict-driven method modification across iterations.

Code Example

The following implementation of the sum of fourth powers of odd numbers was generated and verified end-to-end by GPT-5 on verina_basic_43. It includes a non-trivial closed-form loop invariant, a separate algebraic identity lemma, and a multi-case proof block dispatched by loom_solve.

method sumOfFourthPowerOfOddNumbers (n : Nat) return (result : Nat)
  ensures 15 * result = n * (2 * n + 1) * (7 + 24 * n^3 - 12 * n^2 - 14 * n)
do
  -- Accumulate S_i = sum_{k=0}^{i-1} (2k+1)^4 using a loop
  let mut i : Nat := 0
  let mut acc : Nat := 0
  while i < n
    invariant h_i_le : i ≤ n
    -- Closed-form invariant for the partial sum
    invariant h_acc_closed : 15 * acc = i * (2 * i - 1) * (2 * i + 1) * (12 * i * i - 7)
    done_with h_done : n ≤ i
    decreasing h_dec : n - i
  do
    let odd : Nat := 2 * i + 1
    acc := acc + odd * odd * odd * odd
    i := i + 1
  return acc

lemma mul_factor_identity_nat (n : ℕ) :
  n * (2 * n - 1) * (2 * n + 1) * (12 * n * n - 7)
    = n * (2 * n + 1) * (7 + 24 * n ^ 3 - 12 * n ^ 2 - 14 * n) := by
  sorry  -- 82-line proof omitted

prove_correct sumOfFourthPowerOfOddNumbers by
  loom_solve
  case «ensures» => (
    cases i_3
    have hi_eq : i = n := Nat.le_antisymm h_i_le h_done
    have h1 : 15 * acc = n * (2 * n - 1) * (2 * n + 1) * (12 * n * n - 7) := by
      simpa [hi_eq] using h_acc_closed
    exact h1.trans (mul_factor_identity_nat n)
  )

Contributions

Hybrid Software VerificationWe combine SMT solvers with interactive Lean proofs for LLM-based verification, and introduce the first benchmarks tailored to hybrid environments.
Inference Scaling WorkflowsWe study workflows for large-scale inference, identifying subgoal decomposition and goal-directed modification as key components.
Complex Algorithm VerificationOur system verifies challenging algorithms like Heapsort, managing dozens of subgoals and hundreds of lines of code.
Imperativeness JudgeAn LLM-based judge enforces genuine imperative implementations, addressing specification and functional leakage.

Inference Scaling

Performance shows consistent scaling without plateaus across several orders of magnitude of compute budget. We compare sequential agents and subgoal decomposition across frontier models.

(a) Claude 4.5 Opus. Sequential agents vs. subgoal decomposition on Verina.

Gemini 3 Pro inference scaling on Verina

(b) Gemini 3 Pro. Subgoal decomposition outperforms sequential agents for Gemini.

Model comparison. Inference scaling for multi-agent system with different models on Verina, using up to 128 subagents.

Optimal sequential vs parallel compute allocation

Sequential vs. parallel compute. Given a maximum budget of C language model calls, the optimal breakdown into k·T ≤ C with k ≤ 32 attempts and T ≤ 32 turns per attempt.

Multi-agent scalability. Allocating additional compute to a single subgoal decomposition copy continues to improve performance up to ~1200 model calls, after which pass@2 becomes preferable. For sequential agents the crossover occurs much earlier, at ~22 calls.

Detailed Results

Verina (189 problems)

Budget is represented as turns limit × number of agents launched for each problem. For Sequential Agent, k agents in parallel corresponds to pass@k. Solve rate combines both prove rate and disprove rate. For subgoal decomposition, there is no disproving.

Method	Model	Turns × Agents	Solve Rate
Baseline	DS Prover V2 7B	64 × 1	20.0%
Sequential Agent	GPTOSS-120B	16 × 4	30.2%
	Gemini 3 Pro	32 × 16	55.6%
	Claude 4.5 Sonnet	32 × 16	63.3%
	GPT-5	32 × 16	64.6%
	Claude 4.5 Opus	32 × 16	74.1%
Subgoal Decomp.	Claude 4.5 Sonnet	8 × 128	51.9%
	GPT-5	8 × 128	57.7%
	Gemini 3 Pro	8 × 128	57.7%
	Claude 4.5 Opus	8 × 128	66.7%
	GPT-5 + Goedel Prover	8 × 512	40.7%

Clever-Loom (161 problems)

Method	Model	Turns × Agents	Solve Rate
Baseline	COPRA (Claude 3.7)	600s	8.7%
Sequential Agent	Gemini 3 Pro	32 × 16	32.8%
	GPT-5	32 × 16	53.8%
	Claude 4.5 Sonnet	32 × 16	59.6%
	Claude 4.5 Opus	32 × 16	62.1%
Subgoal Decomp.	Gemini 3 Pro	8 × 128	34.2%
	Claude 4.5 Sonnet	8 × 128	41.0%
	GPT-5	8 × 128	46.6%
	Claude 4.5 Opus	8 × 128	57.8%

Sorting Algorithm Verification

Sorting algorithms sit at the frontier of what is feasible with moderate compute budgets. The successful verification of Heapsort using 357 sub-agents demonstrates the scalability of our approach. Reynolds (1981) gives a by-hand proof in a section starred to connote difficulty, and later formal treatments would sometimes take up an entire paper.

Algorithm	Sub-procedure	Verified
Selection Sort	—	✓
Bubble Sort	—	✓
Insertion Sort	—	✓
Binary Insertion Sort	—	✓
Recursive Quicksort	—	✗
Quicksort	Partition	✓
	Step	✗
	Sort	✓
Heapsort	Heapify	✓
	Maxheap	✓
	Sort	✓
Mergesort	Mergeruns	✓
	Mergepass	✗
	Sort	✓

BibTeX

@article{gloeckle2026wybecoder, title = {WybeCoder: Verified Imperative Code Generation}, author = {Gloeckle, Fabian and Bak{\v{s}}ys, Mantas and Feher, Darius and Zheng, Kunhao and Hayat, Amaury and Holden, Sean B. and Synnaeve, Gabriel and O'Hearn, Peter}, journal = {Preprint}, year = {2026} }