H
Howardismvol. 03 · quiet corner of the web
Plate IIHarnessHOWARDISM

AI-Driven Formal Proof Search

PublishedMay 23, 2026FiledConceptTopicHarnessTagsAI For MathematicsFormal MethodsAgent EngineeringReading6 minSourceAI-synthesised

LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEIS open problems; verification as a filter for human review

Illustration for AI-Driven Formal Proof Search

Sources#

Summary#

The paradigm — demonstrated at research scale by Google DeepMind's AlphaProof Nexus (arXiv 2605.22763) — of using LLMs to generate proofs in a formal language (Lean) whose compiler mechanically verifies every logical step, then searching for a complete proof in a generate-and-verify loop. This converts the LLM's biggest liability for mathematics — hallucinated/subtly-wrong natural-language proofs that need expensive expert review — into a checkable artifact: a proof is correct iff Lean accepts it with no sorry and no disallowed axioms. The paper reports the first large-scale evaluation on open research problems, autonomously resolving 9/353 attempted Erdős problems and 44/492 OEIS conjectures, among other results.

Why formal, not natural language#

LLM natural-language proofs "contain subtle logical errors or hallucinations," and mistakes in unreviewed intermediate steps cascade, capping the complexity of what you can delegate. Formal languages fix this: in Lean, "definitions, theorems, and proofs are all mechanically verified code." The key reframing in the paper's discussion:

Formal verification can serve as a filter for determining which proofs merit human review.

So AI-driven formal proof search doesn't replace mathematicians — it triages. Experts review only what compiled, and within that, focus on the structure rather than re-verifying every line. This is Karpathy's verifiability thesis in its purest form: math+Lean is the maximally-verifiable domain, the compiler is the reward signal.

The proof-sketch interface#

The unit of work is a proof sketch: a Lean file with the target theorem, its dependencies (definitions, imports), and sorry in place of the proof. User-provided markers bound what the agent may edit — EVOLVE-BLOCK (introduce helper lemmas/definitions/steps) and EVOLVE-VALUE (change parameter expressions). The agent succeeds when it emits a sorry-free proof that SafeVerify accepts (compiles + no axiom injection like sorryAx). Optionally the mathematician supplies natural-language context and domain knowledge encoded in Lean. (See AlphaProof Nexus for the agent architectures that drive this loop.)

Compiler feedback as grounding#

The engine is the tight loop between generation and verification: the subagent edits via a search-replace tool, Lean compiles after each edit, and Lean's error message directs the next turn. The paper attributes the surprising strength of even its basic agent partly to "the power of compiler feedback in grounding LLM reasoning" (Agentic Loops Overtake Bespoke Systems). The verifier isn't just a final gate — it's a per-step teacher that keeps the model's reasoning anchored to ground truth.

Results (open research problems)#

  • Erdős problems: 9/353 from the Formal Conjectures repo, including questions open since 1970/1996 and two open ~56 years; logged on Terence Tao's wiki of AI contributions to Erdős problems. Techniques span CRT + 3-AP-avoiding-set constructions (#12), inductive thinning exploiting Diophantine approximation $3^m\approx 4^k$ (#125), etc.
  • OEIS: 44/492 open conjectures (with "test lemmas" verifying the first few sequence terms as a misformalization guard).
  • Algebraic geometry: a ~15-year-open question on log-concavity of pure $O$-sequences (codim 3, type 2).
  • Convex optimization: an exact $\mathcal{O}(1/t)$ rate for Anchored GDA — discovering a novel parameter schedule by marking the learning schedule as an EVOLVE-VALUE (proof and schedule searched jointly).
  • Additive combinatorics: helped resolve #57 from Ben Green's list (formalized a candidate counterexample, agent proved it disproves the conjecture).
  • Quantum optics (with Mario Krenn): monochromatic quantum-graph / high-dim GHZ-state existence for $N=d\in{4,6,10}$.
  • Graph theory: a bipartite variant of the reconstruction conjecture; a 1996 conjecture from the Graffiti auto-conjecturing system (pointing toward an AI-conjecture→AI-proof loop).

Misformalization detection — an unexpected payoff#

Because the agent reasons against the formal statement, it surfaces errors in how problems were formalized. Examples: it found proofs by reading "density" as natural density, prompting corrections to "lower density" (#125) and "upper density" (#741(i)); it identified misformalizations in the literature. Failure modes also justify the formality: top sketches sometimes offloaded the core difficulty into a single sorry in a helper lemma restating the target, or cited "established" lemmas that were hallucinations — both caught precisely because end-to-end formal verification refuses to accept them.

Deepening human understanding#

The paper's stance: "the future of mathematics lies in human–machine partnership." Collaborators found that proof attempts enhanced their understanding even when the agent failed — formal sketches let experts focus on the unresolved subgoals rather than re-verifying the whole argument. This is Outsource Your Thinking, Not Your Understanding realized: the AI does the search; the mathematician's understanding is sharpened, not bypassed.

Connections#

Open Questions#

  • Successes cluster where Lean's mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing new theory?
  • The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
  • The Graffiti result hints at closing the loop between AI conjecturing and AI proving. What does an end-to-end conjecture→formalize→prove pipeline look like?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 14
  • Agent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • Agent Loop Pattern

    `/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…

  • Agentic Loops Overtake Bespoke Systems

    DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Client-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…

  • Jagged Intelligence (Ghosts, Not Animals)

    "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…

  • Lean

    Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…

  • Outsource Your Thinking, Not Your Understanding

    "You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…

  • Scale-Dependent Prompt Sensitivity

    Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…

  • The Verifiability Thesis

    LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…

  • Verification as the New Bottleneck

    Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Related articles
  • Agentic Loops Overtake Bespoke Systems

    DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Andrej Karpathy

    Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • The Verifiability Thesis

    LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…