H
Howardism
Plate IIFormal Math機器翻譯 · machine-translatedENHOWARDISM

Agentic Loops 超越 Bespoke Systems

PublishedMay 23, 2026FiledConceptDomainFormal MathTagsAgent EngineeringAI For MathematicsLLM ArchitectureReading5 minSourceAI-synthesised

DeepMind 的 *basic* Ralph-loop agent 在 LLM 改進時,匹敵了其 bespoke evolutionary+AlphaProof 系統;the bitter lesson / harness-shrinkage 在形式數學中得到證實

Agentic Loops 超越 Bespoke Systems 的插圖

資料來源#

摘要#

DeepMind 的 AI-Driven Formal Proof Search 論文最受矚目的實證發現,也是其對 The Bitter Lesson 最清晰的跨領域確認:一個 basic agent——獨立的 prover subagents 各自運行一個簡單的「Ralph loop」 進行 generate-edit-compile 週期——解決了全部 9 個 Erdős problems,這與功能繁複完備的 agent(evolutionary search + bespoke AlphaProof RL prover)所解決的數量相同,只是在最難的問題上花費了更高成本。論文的結論指出:「隨著 LLMs 變得更加強大,目前正在經歷一場從特化訓練系統向簡單 agentic loops 的持續轉變。」

The surprise, in the authors' words#

團隊基於其強大的競賽基準表現,選擇了功能完備的 agent (D) 進行大規模探索——在規劃階段,「較簡單的 agentic loops 並未表現出強大性能」。隨後,針對 9 個已解決的 Erdős problems 進行的 post-hoc analysis 發現:

「值得注意的是,basic agent 解決了所有 9 個問題,儘管在較難的問題上花費了更高的成本。」

他們將此歸因於兩點:(1) LLM 領域在規劃與分析之間「發生了實質性變化」(模型能力躍升),以及 (2) 「編譯器回饋在鞏固 LLM 推理方面的力量」——簡單 loop 之所以有效,是因為 Lean 的驗證器讓每一步都切實無誤。

Why this is the bitter lesson in a new field#

The Bitter Lesson:隨著時間推移,規模化的通用方法會擊敗人工設計的結構。這裡的「人工設計的結構」是指 bespoke apparatus——AlphaProof RL theorem-prover(一個 特化訓練系統)以及演化種群/Elo 機制(Evolutionary Proof Search)。而「規模化的通用方法」則是處於 plain loop 中並配有驗證器的前沿 LLM。隨著 LLM 的提升,在大多數問題上,bespoke scaffolding 的優勢 坍縮為成本差異,而非能力差異。這正是 Harness Shrinkage as Models Improve 所描述的動態——為彌補模型弱點而設計的 scaffolding 在模型變強時會變成累贅——只是這是在形式數學中觀察到的,而非在 coding harnesses 中。

The residual advantage (and its expiry date)#

bespoke agent 並非毫無用處——它「目前在最難的問題上仍保有優勢」,在兩個最棘手的 Erdős problems(#125, #138)上可節省 2×–5× 的成本。但作者明確指出該優勢的時效:「隨著 LLM 能力的增長,這種優勢可能會減少。」 這種模式是一個 退縮 的前沿,在此處 bespoke systems 才發揮作用:問題越難且模型越弱,特化結構的回報就越高——而這個區域在每次模型發布時都在縮小。(獨立的 AlphaProof tree-search 以及較小模型的 basic agents 什麼也沒解決——因此 loop 仍然需要足夠強大的模型 + 驗證器;參見 Scale-Dependent Prompt Sensitivity。)

Generalization#

可推廣的結論:當一個領域擁有廉價且可靠的驗證器時,應偏好能夠利用它的最簡單 agentic loop,並在每次模型發布時重新評估 bespoke scaffolding——它往往會從「賦予能力(capability-enabling)」轉變為「僅節省成本」,最後變成累贅(drag)。驗證器(The Verifiability Thesis)是讓簡單 loop 可行的關鍵;而 loop 是讓系統建置和維護成本低廉的原因。bespoke systems 只有在退縮的艱難前沿才能發揮其價值。

相關連結#

待解決的問題#

  • bespoke 優勢被定為「目前而言」。下一代模型的結論會是什麼——演化/AlphaProof 機制是否能在 任何 問題上倖存,還是完全坍縮為一條成本線?
  • 「簡單 loop + 驗證器擊敗 bespoke system」的結果,是僅在驗證器完美(Lean)的領域成立,還是在存在雜訊驗證器的領域(測試、LLM 裁判委員會)也成立?

資料來源#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 15
  • Agent Loop Pattern

    `/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…

  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Client-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain in…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • Lean

    Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…

  • Formal Mathematics & Proof Search

    Map of Content for the formal-math domain — 3 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

  • Scale-Dependent Prompt Sensitivity

    Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • When Does Verification Quality Determine Whether AI Automation Works?

    Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…

Related articles
  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • Agent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • The Verifiability Thesis

    LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…