H
Howardism
Plate IILLM Architecture中文HOWARDISM

Task Time-Horizon Scaling

PublishedJune 7, 2026FiledConceptDomainLLM ArchitectureTagsLLM ArchitectureCapability EvaluationBenchmarksCapability TrajectoryReading5 minSourceAI-synthesised

METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7): Opus 3 ~4min (Mar 2024) → Opus 4.6 ~12hr (2026) → weeks projected for 2027; paired with benchmark saturation (SWE-bench, CORE-Bench)

Illustration for Task Time-Horizon Scaling

Sources#

Summary#

The external-benchmark trendline behind Recursive Self-Improvement: the length of task an AI can complete reliably on its own is doubling roughly every four months — accelerated from an earlier ~seven-month doubling. The metric, from METR's time-horizons work, reports the duration over which a model is 50%-reliable at a basket of tasks (the curve looks the same at 80%). It is the quantitative spine of When AI builds itself: where AI Accelerating AI Development shows AI speeding up AI work inside Anthropic, this shows the underlying capability rising on public benchmarks.

The doubling curve#

Model~DateReliable task length
Claude Opus 3Mar 2024~4 minutes
Claude Sonnet 3.7~Mar 2025~1.5 hours
Claude Opus 4.6~2026~12 hours
(projected)this yeardays
(projected)2027weeks

Mythos Preview is at the edge of measurability: METR found it could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks." The trend's acceleration (7-month → 4-month doubling) is the part that matters — it is why the essay argues the loop may close "sooner than most institutions are prepared for."

The June 2026 Mythos-class release pushes further still: Fable 5 / Mythos 5 "can work autonomously for longer than any previous Claude models," and the concrete datapoint is over a week of largely autonomous genomics work (assembling data, designing and training a model, beating a published baseline — see Autonomous Scientific Discovery). A week-long autonomous research run is well past METR's measurable task basket — the metric is now chasing the capability rather than bounding it.

Benchmark saturation as the corroborating signal#

The same pattern appears as benchmarks going from near-zero to "saturated" (≈100%, allowing for errors that cap many benchmarks below 100%):

  • SWE-bench — hands a model a real open-source codebase + bug report and asks for a change that passes the project's own tests. Low single digits → saturated in two years. (Cf. Claude Opus 4.8: 88.6 on SWE-bench Verified.)
  • CORE-Bench — reproduce a published paper's results from its code and data; a prerequisite for conducting original research. ~20% (2024) → saturated in fifteen months.

Saturation is why time-horizon length, not single-benchmark accuracy, has become the more informative capability axis — and why Anthropic retired its task-based AI-R&D benchmarks once models crossed the top human baselines (see AI R&D Autonomy Evaluation (AECI)).

Caveats#

  • Infrastructure strain is a leading indicator, not just trivia. GitHub saw ~1B commits in all of 2025; by mid-2026 it saw ~275M/week (~14B/year pace) and is "pushing incredibly hard" on capacity — a downstream signature of the same throughput surge.
  • Time-horizon numbers are a 50%-reliability statistic on a basket of tasks; the jaggedness within the basket is real — a model that handles a 12-hour task can still fail a trivial one.
  • Whether the curve is a true exponential or an S-curve approaching its bend is the explicit uncertainty of Recursive Self-Improvement's first future.

Connections#

Open questions#

  • Is the 4-month doubling a stable regime or a local steepening? The trend's shape (exponential vs S-curve) is undetermined.
  • Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?

Sources#

  • When AI builds itself — §"Evidence from the outside world" (METR time horizons; SWE-bench / CORE-Bench saturation; GitHub commit-volume footnote)
  • Claude Fable 5 and Claude Mythos 5 — "work autonomously for longer than any previous Claude models"; week-long autonomous genomics
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 13
  • AI Accelerating AI Development

    The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…

  • AI R&D Autonomy Evaluation (AECI)

    How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…

  • Anthropic Institute

    Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…

  • Autonomous Scientific Discovery

    Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…

  • Build for the Next Model

    Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…

  • Claude Fable 5

    Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…

  • Jagged Intelligence (Ghosts, Not Animals)

    "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…

  • METR

    Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.

  • Mythos Model

    Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…

  • Open Questions Backlog

    _96 pages with open questions, as of 2026-06-14._

  • Recursive Self-Improvement

    An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

Related articles