Sources#
Summary#
The empirical half of the Anthropic Institute's When AI builds itself — the previously-unreported internal data showing that AI is already accelerating the development of AI at Anthropic. Where public benchmarks (Task Time-Horizon Scaling) show capability rising, this page collects the deployment-side evidence that the rising capability is already feeding back into Anthropic's own engineering and research throughput. It is the present-tense ground the Recursive Self-Improvement extrapolation stands on, and the concrete instance of the acceleration the AI R&D autonomy eval gates against.
The engineering / research split (and the autonomy ladder)#
Building a frontier model takes two kinds of work, and Claude has progressed differently on each:
- Engineering (writing code, standing up infra, overseeing training): Claude "can be handed an underspecified problem and figure out how to solve it; humans supply the goal, but they no longer need to supply the method."
- Research (choosing experiments, interpreting results, deciding what to try next): Claude "can already match or outperform skilled humans at executing a well-specified experiment."
Across both, the persistent gap is judgment in choosing goals — see Research Taste as the Human Bottleneck. The essay maps capability onto the seniority ladder Anthropic uses for its own people:
- Execute a specified task — "The export button isn't working, please fix it."
- Set the approach for a given goal — "Investigate why the network slows down under heavy load."
- Choose which problems are worth working on — "What should the team build next quarter?"
Claude has climbed from rung 1 into rung 2; rung 3 is the frontier.
Engineering evidence#
Claude writes most of Anthropic's code. As of May 2026, >80% of merged code is Claude-authored, up from low single digits before Claude Code's Feb 2025 research preview. (Leadership has publicly estimated 90%+ including scripts/experimental code; the >80% figure is the more conservative lines-merged-to-production attribution.)
~8× output per engineer. Lines merged per engineer per day held flat 2021–2024, then climbed with two inflection points: 2025 (Claude began to run code, not just suggest it) and 2026 (models began working autonomously over longer horizons). In Q2 2026 the typical engineer merged ~8× as much code/day as in 2024. Caveat stated plainly: lines-of-code measures quantity over quality, so 8× "is almost certainly an overstatement of the true productivity gain" — but it indicates real acceleration, and Anthropic does not reward LOC.
- A March 2026 poll (130 research-team employees) put median self-estimated output at ~4× with Mythos Preview vs no AI (Anthropic believes true uplift was somewhat lower — developer self-estimates are known to overestimate).
- Work that wouldn't have happened otherwise: in April 2026 Claude shipped 800+ fixes that cut a class of API errors 1000× — estimated at four human-years of painstaking cross-context bug-solving.
Code quality reached parity. "Good code" = it works and another engineer can build on it. On works: the rate at which staff correct/redirect/take over mid-task has fallen steadily for a year, even on open-ended problems (session success on the hardest tier reached 76% in May 2026, +50pp in six months). On legibility: Claude-written code was "somewhat worse than human-written … in late 2025, is roughly at parity today, and we expect it to be strictly better within the year."
The automated reviewer. Every change is now read by an automated Claude reviewer before merge. A retrospective found it would have caught ~1/3 of the bugs behind past claude.ai incidents before production — "Claude is now catching the mistakes that [the best engineers in the world] missed." This is verification and the reviewer-in-fresh-context pattern operationalized at org scale.
Research evidence#
Three measurements, climbing the ladder from execution toward judgment:
- Kernel/experiment optimization (rung 1–2, now superhuman). A fixed eval run every release: given code that trains a small model, make it run as fast as possible while passing the same correctness checks. Opus 4 ~3× speedup (May 2025) → Mythos Preview ~52× (Apr 2026). A skilled human needs 4–8 hours to reach ~4×. "In under a year, Claude has gone from super helpful to superhuman" at optimizing within a defined experiment. (Caveat: the absolute multiple depends on headroom in the starting code and is not a real-world training speedup; the informative part is the like-for-like comparison across models and against humans.)
- Open-ended research, end to end (rung 2). The April 2026 automated weak-to-strong researcher: agents given an open AI-safety problem (can a weak model reliably supervise a stronger one?) proposed hypotheses, tested them, shared findings across parallel agents, and iterated. They recovered ~97% of the floor-to-ceiling gap over
800 cumulative agent-hours ($18k compute); two human researchers recovered ~23% over a week. Caveats: the result didn't transfer cleanly to production-scale models, and humans still chose the problem and wrote the scoring rubric — direction-setting was the only meaningful human role. - Research next-step judgment (rung 2–3, the frontier signal). On n=129 real research sessions where the human took a detour, models saw only the pre-detour work and proposed a next step; a separate Claude (which saw how the session ended) judged who chose better. Opus 4.5 (Nov 2025) beat the human 51% → Mythos Preview (Apr 2026) 64%. Crucial caveat on judge bias: on a separate set of 127 moments where the human's move was already strong, models were judged better only ~20% of the time — so this is not a like-for-like human-vs-model comparison, but a trend on hard, ambiguous decisions.
The honest caveats#
The essay is unusually careful to bound its own evidence: LOC overstates productivity; self-reported uplift is biased upward; the kernel multiple is headroom-dependent; the W2S result didn't transfer to scale and used a human-chosen problem; the next-step test was run on deliberately-chosen weak-human moments. The load-bearing claim survives all of them: the human role is narrowing at each step, and the doing now costs almost nothing in human time (though still in compute).
Connections#
- Recursive Self-Improvement — the extrapolation this evidence grounds; "the loop is already tightening"
- AI R&D Autonomy Evaluation (AECI) — the formal capability gate (AECI, substitution threshold); this page is the deployment-side correlate the eval anticipated
- Research Taste as the Human Bottleneck — the persistent gap these measurements keep hitting: choosing goals, not executing them
- Task Time-Horizon Scaling — the external-benchmark companion (METR, SWE-bench, CORE-Bench) to this internal data
- Verification as the New Bottleneck — the automated reviewer and "review became the new bottleneck" are this thesis at org scale
- Harness Shrinkage as Models Improve — the same narrowing role: humans stop writing code, shift to direction and review
- The Bitter Lesson — "research progress is mostly a function of tools and resources" is the bitter lesson applied to R&D itself
- Agentic Loops Overtake Bespoke Systems — the same simple-loop-overtakes-bespoke dynamic, measured in formal math
- LLM-Driven Vulnerability Research — Project Glasswing as a worked example of AI-accelerated technical output
- AI-Native Startup Lifecycle — the diffusion of this acceleration into the wider economy: 100-person firms doing 1,000-person work
- Frontier Pause Verification — compounding acceleration is why "we don't have decades" to build a verification regime
Open questions#
- LOC, self-reports, and headroom-dependent multiples all overstate; what unbiased throughput metric would Anthropic's promised shift to "direct measurement of AI R&D acceleration and researcher uplift" (AI R&D Autonomy Evaluation (AECI)) actually use?
- The W2S result didn't transfer to production-scale models. Is that a temporary scaling artifact or a structural limit on autonomous research?
- The next-step judgment trend (51%→64%) is measured only on weak-human-move slices. What does the curve look like on a representative sample of research decisions?
Sources#
- When AI builds itself — §"Evidence from within Anthropic" (engineering + research evidence, productivity poll, kernel eval, W2S researcher, next-step judgment)
Cited by 18
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
- METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
- Governance & Workforce
Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
- Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Related articles
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
