Plate IISuperintelligence Trajectory機器翻譯 · machine-translated過時翻譯 · stale translationENHOWARDISM

AI 加速 AI 開發

PublishedJune 7, 2026FiledConceptDomainSuperintelligence TrajectoryTagsGovernanceAI RdProductivityCapability EvaluationAnthropicReading8 minSourceAI-synthesised

*When AI builds itself* 的實證核心：已量測的證據顯示，AI 已在 Anthropic 加速 AI 研發——合併程式碼中超過 80% 由 Claude 撰寫；每日每位工程師的程式碼量較 2024 年約提升 8 倍；核心最佳化 eval 在一年內從 3 倍提升至 52 倍；自動化研究員找回弱到強差距的 97%；模型對研究下一步的判斷有 64% 的機率勝過人類

資料來源#

When AI builds itself

摘要#

Anthropic Institute 的 When AI builds itself 之實證部分——先前未公開的內部資料顯示，AI 已在 Anthropic 加速 AI 的開發。 公開基準測試（Task Time-Horizon Scaling）呈現能力上升之際，本頁整理了部署端的證據，顯示上升中的能力已經回饋到 Anthropic 自身的工程與研究產出。這是 Recursive Self-Improvement 外推所立足的當下實證，也是 AI R&D autonomy eval 所檢驗之加速的具體案例。

工程／研究分工（以及自主性階梯）#

打造前沿模型需要兩類工作，而 Claude 在兩者上的進展並不相同：

工程（撰寫程式碼、建置基礎設施、監督訓練）：Claude「可以接手一個規格不完整的問題，找出解法；人類提供目標，但不再需要提供方法。」
研究（選擇實驗、解讀結果、決定下一步嘗試什麼）：Claude「已經能在執行規格明確的實驗時，匹敵或超越熟練的人類。」

兩者持續存在的差距，都是選擇目標時的判斷力——請見研究品味是人類瓶頸。這篇文章將能力對應到 Anthropic 用於自身員工的資歷階梯：

執行指定任務——「匯出按鈕不能用，請修好它。」
為既定目標設定方法——「調查網路在高負載下變慢的原因。」
選擇哪些問題值得投入——「團隊下季應該打造什麼？」

Claude 已從第 1 階爬入第 2 階；第 3 階仍是前沿。

工程證據#

Claude 撰寫了 Anthropic 大部分的程式碼。 截至 2026 年 5 月，超過 80% 的合併程式碼由 Claude 撰寫；在 Claude Code 於 2025 年 2 月推出研究預覽版前，這一比例還只有個位數。（領導層曾公開估計，若包含腳本／實驗性程式碼，比例超過 90%；超過 80% 是較保守的「合併至 production 的程式碼行數」歸因。）

每位工程師的產出約提升 8 倍。 2021 至 2024 年間，每位工程師每日合併的程式碼行數持平，之後出現兩個轉折點：2025 年（Claude 開始執行程式碼，而不只是提出建議）以及 2026 年（模型開始在更長時間範圍內自主工作）。2026 年第二季，一般工程師每日合併的程式碼量約為 2024 年的8 倍。必須直白說明的限制是：程式碼行數衡量的是數量而非品質，因此 8 倍「幾乎肯定高估了真正的生產力提升」——但它確實顯示了加速，而 Anthropic 並不以 LOC 作為獎勵依據。

2026 年 3 月的一項民調（130 名研究團隊員工）顯示，相較於不使用 AI，使用 Mythos Preview 後自我估計的產出中位數約為4 倍（Anthropic 認為真實提升幅度稍低——開發者的自我估計已知會高估）。
原本不會發生的工作：2026 年 4 月，Claude 交付了 800 多項修正，將某一類 API 錯誤降低 1000 倍——估計相當於四個人年艱苦的跨情境錯誤排查。

程式碼品質已達到同等水準。「好程式碼」的定義是：它能運作，且另一位工程師能以此為基礎繼續開發。在「能運作」方面，員工在任務中途修正／重新指示／接管的比例已連續一年穩定下降，即使是開放式問題也一樣（最難等級的工作階段成功率在 2026 年 5 月達到 76%，六個月內提升 50 個百分點）。在「可讀性」方面，Claude 撰寫的程式碼「在 2025 年底略遜於人類撰寫的程式碼……如今大致相當，而我們預期在一年內會明顯更好。」

自動化審查者。 現在每項變更在合併前都會先由自動化 Claude 審查者讀取。回顧分析發現，在過去 claude.ai 事故背後的錯誤中，它原本可以在進入 production 前抓出約三分之一——「Claude 現在抓到了[世界上最優秀的工程師]漏掉的錯誤。」這是驗證以及「在全新上下文中進行審查」模式於組織規模上的實際運用。

研究證據#

三項測量，沿著從執行走向判斷的階梯逐步上升：

核心／實驗最佳化（第 1–2 階，如今已超越人類）。 每次發布都執行固定 eval：給定一段能訓練小型模型的程式碼，在通過相同正確性檢查的前提下，讓它盡可能快速執行。Opus 4 約提升 3 倍速度（2025 年 5 月）→ Mythos Preview 約提升 52 倍（2026 年 4 月）。 熟練的人類需要 4–8 小時才能達到約 4 倍。「不到一年內，Claude 已從非常有幫助進步到超越人類」——在定義明確的實驗內進行最佳化方面確實如此。（限制：絕對倍數取決於起始程式碼的可提升空間，並不代表現實世界的訓練速度提升；有資訊價值的是不同模型之間以及模型與人類之間的同類比較。）
開放式研究，從頭到尾（第 2 階）。 2026 年 4 月的自動化弱到強研究員：代理面對一個開放式 AI 安全問題（弱模型能可靠地監督更強模型嗎？），提出假設、測試假設、在平行代理間分享發現，並反覆迭代。他們在約 800 個累計代理工作時（約 18,000 美元的計算成本）中，找回了從最低到最高差距的約 97%；兩名人類研究員花了一週找回約 23%。限制是：結果沒有乾淨地轉移到 production 規模的模型，而且人類仍然選擇了問題並撰寫評分規範——設定方向是唯一有意義的人類角色。
研究下一步的判斷（第 2–3 階，前沿訊號）。 在 n=129 個真實研究工作階段中，人類曾經繞道而行；模型只能看到繞道前的工作，並提出下一步；另一個 Claude（看到了工作階段如何結束）則判斷誰選擇得更好。Opus 4.5（2025 年 11 月）以 51% 擊敗人類 → Mythos Preview（2026 年 4 月）達到 64%。 關於評審偏差，有一項關鍵限制：在另一組 127 個人類行動本來就很強的時刻中，模型只有約 20% 的時間被判定為更好——因此這不是人類與模型的同類比較，而是在困難、模糊決策上的趨勢。

誠實的限制#

這篇文章非常謹慎地界定自身證據的範圍：LOC 會高估生產力；自我回報的提升幅度有向上偏誤；核心倍數取決於可提升空間；W2S 結果沒有轉移到大規模模型，且使用了人類選擇的問題；下一步測試是在刻意挑選的人類弱勢時刻上執行。承重的主張在所有這些限制下仍然成立：人類角色在每一步都逐漸縮小，而如今「執行」幾乎不再耗費人類時間（但仍會消耗計算資源）。

開放問題#

LOC、自我回報與取決於可提升空間的倍數都會高估；Anthropic 承諾轉向「直接測量 AI 研發加速與研究員提升」後，實際會使用什麼無偏的產出指標（AI R&D Autonomy Evaluation (AECI)）？部分已有答案：Researcher Uplift from Code Output——Kwa 認為程式碼產出（也就是 8 倍本身）勝過以每小時計算的程式碼提升，因為產出已透過時間重新分配將邊際價值計入，且不受生產函數假設影響；但它仍受到冗餘、幾乎沒有用的「凱迪拉克」程式碼，以及受樂趣驅動的時間分配變化所污染——因此它真正指向的指標是品質調整後的程式碼產出，而這仍需要 LoC 無法提供的內部資料。
W2S 結果沒有轉移到 production 規模的模型。這是暫時性的擴展效應，還是自主研究的結構性限制？
下一步判斷趨勢（51%→64%）只在弱勢人類行動的切片上測量。在具代表性的研究決策樣本上，曲線會呈現什麼樣子？

資料來源#

When AI builds itself——§「Evidence from within Anthropic」（工程與研究證據、生產力民調、核心 eval、W2S 研究員、下一步判斷）

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 20

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Superintelligence Trajectory
Map of Content for the superintelligence-trajectory domain — 20 concepts. The path from AGI to ASI: recursive self-impr…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Researcher Uplift from Code Output
Thomas Kwa (METR) translates Anthropic's reported 8× code-per-engineer-per-day into serial researcher uplift with produ…
RSI Growth Curves: Which Friction Binds First?
DeepMind's exponential/hyperbolic/S-curve growth shapes are Anthropic's compounding-efficiency/full-RSI/stalled futures…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

Cited by 20

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Superintelligence Trajectory
Map of Content for the superintelligence-trajectory domain — 20 concepts. The path from AGI to ASI: recursive self-impr…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
Researcher Uplift from Code Output
Thomas Kwa (METR) translates Anthropic's reported 8× code-per-engineer-per-day into serial researcher uplift with produ…
RSI Growth Curves: Which Friction Binds First?
DeepMind's exponential/hyperbolic/S-curve growth shapes are Anthropic's compounding-efficiency/full-RSI/stalled futures…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

AI 加速 AI 開發

資料來源#

摘要#

工程／研究分工（以及自主性階梯）#

工程證據#

研究證據#

誠實的限制#

相關連結#

開放問題#

資料來源#