H
Howardismvol. 03 · quiet corner of the web
Plate IIHarnessHOWARDISM

Agentic Loops Overtake Bespoke Systems

PublishedMay 23, 2026FiledConceptTopicHarnessTagsAgent EngineeringAI For MathematicsLLM ArchitectureReading5 minSourceAI-synthesised

DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter lesson / harness-shrinkage confirmed in formal math

Illustration for Agentic Loops Overtake Bespoke Systems

Sources#

Summary#

The headline empirical finding of DeepMind's AI-Driven Formal Proof Search paper, and its clearest cross-domain confirmation of The Bitter Lesson: a basic agent — independent prover subagents each running a simple "Ralph loop" of generate-edit-compile episodes — solved all 9 of the Erdős problems that the elaborate full-featured agent (evolutionary search + the bespoke AlphaProof RL prover) solved, only at higher cost on the hardest ones. The paper's conclusion: "an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."

The surprise, in the authors' words#

The team chose the full-featured agent (D) for the large-scale exploration based on its strong competition-benchmark performance — at planning time, "simpler agentic loops did not show strong performance." Then a post-hoc analysis on the 9 solved Erdős problems found:

"Remarkably, the basic agent solved all 9 problems, though at a higher cost on the harder problems."

They attribute this to two things: (1) the LLM landscape "shifted substantially" between planning and analysis (model capability jumped), and (2) "the power of compiler feedback in grounding LLM reasoning" — the simple loop works because Lean's verifier keeps each step honest.

Why this is the bitter lesson in a new field#

The Bitter Lesson: scaled general methods beat hand-engineered structure over time. Here the "hand-engineered structure" is the bespoke apparatus — the AlphaProof RL theorem-prover (a specialized trained system) and the evolutionary population/Elo machinery (Evolutionary Proof Search). The "scaled general method" is a frontier LLM in a plain loop with a verifier. As the LLM improved, the bespoke scaffolding's advantage collapsed to a cost difference, not a capability difference on most problems. This is the exact dynamic Harness Shrinkage as Models Improve describes — scaffolding that compensates for model weakness becomes drag as the model strengthens — observed in formal mathematics rather than in coding harnesses.

The residual advantage (and its expiry date)#

The bespoke agent isn't useless — it "retains an advantage on the hardest problems for now," with 2×–5× cost savings on the two toughest Erdős problems (#125, #138). But the authors explicitly date the advantage: "as LLM capabilities grow, this advantage may diminish." The pattern is a receding frontier where bespoke systems matter: the harder the problem and the weaker the model, the more the specialized structure pays off — and that region shrinks every model release. (Standalone AlphaProof tree-search and smaller-model basic agents solved nothing — so the loop still needs a strong-enough model + a verifier; see Scale-Dependent Prompt Sensitivity.)

Generalization#

The transferable claim: when a domain has a cheap, reliable verifier, prefer the simplest agentic loop that can exploit it, and re-evaluate bespoke scaffolding every model release — it tends to convert from capability-enabling to merely cost-saving, then to drag. The verifier (The Verifiability Thesis) is what makes the simple loop viable; the loop is what makes the system cheap to build and maintain. Bespoke systems earn their keep only at the receding hard frontier.

Connections#

Open Questions#

  • The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on any problems, or fully collapse to a cost line?
  • Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 9
  • Agent Loop Pattern

    `/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Client-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • Lean

    Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

Related articles
  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • Agent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • The Verifiability Thesis

    LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…