Sources#
Summary#
The external-benchmark trendline behind Recursive Self-Improvement: the length of task an AI can complete reliably on its own is doubling roughly every four months — accelerated from an earlier ~seven-month doubling. The metric, from METR's time-horizons work, reports the duration over which a model is 50%-reliable at a basket of tasks (the curve looks the same at 80%). It is the quantitative spine of When AI builds itself: where AI Accelerating AI Development shows AI speeding up AI work inside Anthropic, this shows the underlying capability rising on public benchmarks.
The doubling curve#
| Model | ~Date | Reliable task length |
|---|---|---|
| Claude Opus 3 | Mar 2024 | ~4 minutes |
| Claude Sonnet 3.7 | ~Mar 2025 | ~1.5 hours |
| Claude Opus 4.6 | ~2026 | ~12 hours |
| (projected) | this year | days |
| (projected) | 2027 | weeks |
Mythos Preview is at the edge of measurability: METR found it could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks." The trend's acceleration (7-month → 4-month doubling) is the part that matters — it is why the essay argues the loop may close "sooner than most institutions are prepared for."
The June 2026 Mythos-class release pushes further still: Fable 5 / Mythos 5 "can work autonomously for longer than any previous Claude models," and the concrete datapoint is over a week of largely autonomous genomics work (assembling data, designing and training a model, beating a published baseline — see Autonomous Scientific Discovery). A week-long autonomous research run is well past METR's measurable task basket — the metric is now chasing the capability rather than bounding it.
Benchmark saturation as the corroborating signal#
The same pattern appears as benchmarks going from near-zero to "saturated" (≈100%, allowing for errors that cap many benchmarks below 100%):
- SWE-bench — hands a model a real open-source codebase + bug report and asks for a change that passes the project's own tests. Low single digits → saturated in two years. (Cf. Claude Opus 4.8: 88.6 on SWE-bench Verified.)
- CORE-Bench — reproduce a published paper's results from its code and data; a prerequisite for conducting original research. ~20% (2024) → saturated in fifteen months.
Saturation is why time-horizon length, not single-benchmark accuracy, has become the more informative capability axis — and why Anthropic retired its task-based AI-R&D benchmarks once models crossed the top human baselines (see AI R&D Autonomy Evaluation (AECI)).
Caveats#
- Infrastructure strain is a leading indicator, not just trivia. GitHub saw ~1B commits in all of 2025; by mid-2026 it saw ~275M/week (~14B/year pace) and is "pushing incredibly hard" on capacity — a downstream signature of the same throughput surge.
- Time-horizon numbers are a 50%-reliability statistic on a basket of tasks; the jaggedness within the basket is real — a model that handles a 12-hour task can still fail a trivial one.
- Whether the curve is a true exponential or an S-curve approaching its bend is the explicit uncertainty of Recursive Self-Improvement's first future.
Connections#
- Recursive Self-Improvement — this curve, extrapolated, is the quantitative case that the loop could close soon
- AI Accelerating AI Development — the internal-throughput companion to this external-benchmark evidence
- Jagged Intelligence (Ghosts, Not Animals) — the within-basket caveat: long-horizon competence coexists with trivial failures
- The Bitter Lesson — rising capability on general benchmarks is what makes hand-built scaffolding a shrinking advantage
- AI R&D Autonomy Evaluation (AECI) — why saturated task-based benchmarks were retired from RSP determinations
- Build for the Next Model — the forecastable capability curve this measures is what makes "bet on the next release" a rational product strategy rather than a gamble
- Autonomous Scientific Discovery — Mythos 5's week-long autonomous genomics run is a concrete long-horizon datapoint past Mythos Preview's measured 16h ceiling
Open questions#
- Is the 4-month doubling a stable regime or a local steepening? The trend's shape (exponential vs S-curve) is undetermined.
- Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?
Sources#
- When AI builds itself — §"Evidence from the outside world" (METR time horizons; SWE-bench / CORE-Bench saturation; GitHub commit-volume footnote)
- Claude Fable 5 and Claude Mythos 5 — "work autonomously for longer than any previous Claude models"; week-long autonomous genomics
Cited by 13
- AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
- Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
- Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
- Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Related articles
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
