Plate IIFormal Math中文HOWARDISM

AI-Driven Formal Proof Search

PublishedMay 23, 2026FiledConceptDomainFormal MathTagsAI For Mathematics Formal Methods Agent EngineeringReading7 minSourceAI-synthesised

LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEIS open problems; verification as a filter for human review

Illustration for AI-Driven Formal Proof Search

Sources#

Advancing Mathematics Research with AI-Driven Formal Proof Search

Summary#

The paradigm — demonstrated at research scale by Google DeepMind's AlphaProof Nexus (arXiv 2605.22763) — of using LLMs to generate proofs in a formal language (Lean) whose compiler mechanically verifies every logical step, then searching for a complete proof in a generate-and-verify loop. This converts the LLM's biggest liability for mathematics — hallucinated/subtly-wrong natural-language proofs that need expensive expert review — into a checkable artifact: a proof is correct iff Lean accepts it with no sorry and no disallowed axioms. The paper reports the first large-scale evaluation on open research problems, autonomously resolving 9/353 attempted Erdős problems and 44/492 OEIS conjectures, among other results.

Why formal, not natural language#

LLM natural-language proofs "contain subtle logical errors or hallucinations," and mistakes in unreviewed intermediate steps cascade, capping the complexity of what you can delegate. Formal languages fix this: in Lean, "definitions, theorems, and proofs are all mechanically verified code." The key reframing in the paper's discussion:

Formal verification can serve as a filter for determining which proofs merit human review.

So AI-driven formal proof search doesn't replace mathematicians — it triages. Experts review only what compiled, and within that, focus on the structure rather than re-verifying every line. This is Karpathy's verifiability thesis in its purest form: math+Lean is the maximally-verifiable domain, the compiler is the reward signal.

The proof-sketch interface#

The unit of work is a proof sketch: a Lean file with the target theorem, its dependencies (definitions, imports), and sorry in place of the proof. User-provided markers bound what the agent may edit — EVOLVE-BLOCK (introduce helper lemmas/definitions/steps) and EVOLVE-VALUE (change parameter expressions). The agent succeeds when it emits a sorry-free proof that SafeVerify accepts (compiles + no axiom injection like sorryAx). Optionally the mathematician supplies natural-language context and domain knowledge encoded in Lean. (See AlphaProof Nexus for the agent architectures that drive this loop.)

Compiler feedback as grounding#

The engine is the tight loop between generation and verification: the subagent edits via a search-replace tool, Lean compiles after each edit, and Lean's error message directs the next turn. The paper attributes the surprising strength of even its basic agent partly to "the power of compiler feedback in grounding LLM reasoning" (Agentic Loops Overtake Bespoke Systems). The verifier isn't just a final gate — it's a per-step teacher that keeps the model's reasoning anchored to ground truth.

Results (open research problems)#

Erdős problems: 9/353 from the Formal Conjectures repo, including questions open since 1970/1996 and two open ~56 years; logged on Terence Tao's wiki of AI contributions to Erdős problems. Techniques span CRT + 3-AP-avoiding-set constructions (#12), inductive thinning exploiting Diophantine approximation $3^m\approx 4^k$ (#125), etc.
OEIS: 44/492 open conjectures (with "test lemmas" verifying the first few sequence terms as a misformalization guard).
Algebraic geometry: a ~15-year-open question on log-concavity of pure $O$-sequences (codim 3, type 2).
Convex optimization: an exact $\mathcal{O}(1/t)$ rate for Anchored GDA — discovering a novel parameter schedule by marking the learning schedule as an EVOLVE-VALUE (proof and schedule searched jointly).
Additive combinatorics: helped resolve #57 from Ben Green's list (formalized a candidate counterexample, agent proved it disproves the conjecture).
Quantum optics (with Mario Krenn): monochromatic quantum-graph / high-dim GHZ-state existence for $N=d\in{4,6,10}$.
Graph theory: a bipartite variant of the reconstruction conjecture; a 1996 conjecture from the Graffiti auto-conjecturing system (pointing toward an AI-conjecture→AI-proof loop).

Misformalization detection — an unexpected payoff#

Because the agent reasons against the formal statement, it surfaces errors in how problems were formalized. Examples: it found proofs by reading "density" as natural density, prompting corrections to "lower density" (#125) and "upper density" (#741(i)); it identified misformalizations in the literature. Failure modes also justify the formality: top sketches sometimes offloaded the core difficulty into a single sorry in a helper lemma restating the target, or cited "established" lemmas that were hallucinations — both caught precisely because end-to-end formal verification refuses to accept them.

Deepening human understanding#

The paper's stance: "the future of mathematics lies in human–machine partnership." Collaborators found that proof attempts enhanced their understanding even when the agent failed — formal sketches let experts focus on the unresolved subgoals rather than re-verifying the whole argument. This is Outsource Your Thinking, Not Your Understanding realized: the AI does the search; the mathematician's understanding is sharpened, not bypassed.

Connections#

Agent Harness Engineering — EVOLVE-BLOCK enforces invariants-not-implementations, a harness-engineering pattern
Verification as the New Bottleneck — compiler-verified proofs are the purest case of verification as the gating step
AlphaProof Nexus — the framework and agent architectures that implement the paradigm
Lean — the proof assistant whose compiler provides the verification/grounding
The Verifiability Thesis — math+Lean is the maximally-verifiable domain; the compiler is the reward signal
Agentic Loops Overtake Bespoke Systems — the headline finding: a simple loop matched the bespoke system as LLMs improved
Evolutionary Proof Search — the full-featured agent's population/Elo search mechanism
Agent Loop Pattern — the basic prover subagent is literally a "Ralph loop" (huntley2025ralph)
Outsource Your Thinking, Not Your Understanding — formal sketches deepen mathematician understanding even on unsolved problems
Client-Side Agent Optimization — solve-rate-vs-cost Pareto curves across agents (A/B/C/D) are the same cost/quality framing AgentOpt formalizes
Scale-Dependent Prompt Sensitivity — smaller Gemini models solved nothing; capability is sharply scale-gated here (a hard threshold, not a smooth curve)
Jagged Intelligence (Ghosts, Not Animals) — hallucinated "literature" lemmas are jaggedness; formal verification is the filter that catches it
Autonomous Scientific Discovery — the wet-lab/life-sciences sibling: AI doing novel research without a Lean-style instant verifier, so the (slow, costly) experiment is the reward signal rather than a compiler
Intelligence Explosion Dynamics — FunSearch/AlphaEvolve-style LLM-guided program search is concrete algorithmic self-improvement: AI finding novel constructions beyond its training distribution
Transformative Creativity — DeepMind's report places new theorem-proving at Boden levels 1–2 (exploratory creativity within Lean's formal conceptual space)
Deep Research Agents — the no-instant-verifier sibling in open-domain research: factual accuracy is its weakest axis precisely because there's no Lean-style compiler to ground each claim
LLM-as-a-Judge — what open-domain research must fall back on absent a sound verifier; the contrast with the compiler's total verification here
Optimizer–Evaluator Decoupling — the compiler is the limit case of that rule: an evaluator not merely independent of the prover but sound, which is why proof loops can run fully autonomous while agent eval-fix loops stay human-gated

Open Questions#

Successes cluster where Lean's mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing new theory?
The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
The Graffiti result hints at closing the loop between AI conjecturing and AI proving. What does an end-to-end conjecture→formalize→prove pipeline look like?

Sources#

Advancing Mathematics Research with AI-Driven Formal Proof Search

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 23

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
Intelligence Explosion Dynamics
The growth-curve question behind recursive self-improvement: whether AI-accelerating-AI produces exponential, super-exp…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
Formal Mathematics & Proof Search
Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Optimizer–Evaluator Decoupling
The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never…
Outsource Your Thinking, Not Your Understanding
"You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
Transformative Creativity
Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievem…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

Cited by 23

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
Intelligence Explosion Dynamics
The growth-curve question behind recursive self-improvement: whether AI-accelerating-AI produces exponential, super-exp…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
Formal Mathematics & Proof Search
Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Optimizer–Evaluator Decoupling
The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never…
Outsource Your Thinking, Not Your Understanding
"You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
Transformative Creativity
Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievem…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…