Plate IIEvals & Benchmarks中文HOWARDISM

Task Time-Horizon Scaling

PublishedJune 7, 2026FiledConceptDomainEvals & BenchmarksTagsLLM Architecture Capability Evaluation Benchmarks Capability TrajectoryReading14 minSourceAI-synthesised

METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7): Opus 3 ~4min (Mar 2024) → Opus 4.6 ~12hr (2026) → weeks projected for 2027; paired with benchmark saturation (SWE-bench, CORE-Bench)

Illustration for Task Time-Horizon Scaling

Sources#

Summary#

The external-benchmark trendline behind Recursive Self-Improvement: the length of task an AI can complete reliably on its own is doubling roughly every four months — accelerated from an earlier ~seven-month doubling. The metric, from METR's time-horizons work, reports the duration over which a model is 50%-reliable at a basket of tasks (the curve looks the same at 80%). It is the quantitative spine of When AI builds itself: where AI Accelerating AI Development shows AI speeding up AI work inside Anthropic, this shows the underlying capability rising on public benchmarks.

The doubling curve#

Model	~Date	Reliable task length
Claude Opus 3	Mar 2024	~4 minutes
Claude Sonnet 3.7	~Mar 2025	~1.5 hours
Claude Opus 4.6	~2026	~12 hours
(projected)	this year	days
(projected)	2027	weeks

Mythos Preview is at the edge of measurability: METR found it could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks." The trend's acceleration (7-month → 4-month doubling) is the part that matters — it is why the essay argues the loop may close "sooner than most institutions are prepared for."

The June 2026 Mythos-class release pushes further still: Fable 5 / Mythos 5 "can work autonomously for longer than any previous Claude models," and the concrete datapoint is over a week of largely autonomous genomics work (assembling data, designing and training a model, beating a published baseline — see Autonomous Scientific Discovery). A week-long autonomous research run is well past METR's measurable task basket — the metric is now chasing the capability rather than bounding it.

The horizon — and its doubling rate — is itself budget-dependent (UK AISI)#

The UK AI Security Institute's July 2026 study (empirical) adds a confound the doubling curve above does not name: a model's estimated time horizon, and the rate at which that horizon doubles, both depend on the compute budget the evaluation allows. A horizon number is only defined relative to a budget.

On AISI's narrow cyber CTF suite, frontier time horizons doubled every 4.7 months since late 2024 when measured at 2.5M tokens/task — close to METR's ~4-month figure on a different (general) basket. But re-fit the same suite at a larger budget and the curve steepens:

The fitted frontier trend is ~60% steeper at 50M tokens than at 2.5M tokens per task. AISI's own gloss: "the estimated doubling rate is partly a consequence of the compute budget used in the evaluation, not a fixed property of frontier cyber progress."
At the model level, one recent frontier model's 80% horizon rose from ~40 minutes at 2.5M tokens to ~4 hours at 50M; at the current frontier, raising the budget 2.5M→50M lifts the estimated horizon from ~2 hours to ~14 hours.

The mechanism is a second AISI result: the compute an agent needs scales with how long a task takes a skilled human — a power law with fitted exponent ~0.7–1.0 across AISI's 78 cyber CTFs and METR's 211 software-engineering tasks (a minute-task ≈ thousands of tokens, an hour ≈ millions, a week ≈ billions). It holds even for the cheapest successful run per task, so the floor is set by the work the task requires, not by inefficiency. Because longer tasks demand more compute, a fixed budget runs out on the longest tasks first — so a capped evaluation systematically understates the horizon, and a failure on a long task may mean the run was under-budgeted, not that the model lacked the capability. AISI's cyber range "The Last Ones" (~20 human-hours) went unsolved by every model until the budget reached ≥30M tokens; on Epoch's MirrorCode, a recent model spent up to 1 billion tokens to make progress (weeks of human work) that prior models could not.

This reframes this page's central number. The ~4-month doubling is real, but it is a statistic at an implicit budget; measured at a larger budget the frontier appears to move faster still. Whether the curve is a stable exponential is now entangled with a third question — at what budget? — on top of the exponential-vs-S-curve one below.

Benchmark saturation as the corroborating signal#

The same pattern appears as benchmarks going from near-zero to "saturated" (≈100%, allowing for errors that cap many benchmarks below 100%):

SWE-bench — hands a model a real open-source codebase + bug report and asks for a change that passes the project's own tests. Low single digits → saturated in two years. (Cf. Claude Opus 4.8: 88.6 on SWE-bench Verified.)
CORE-Bench — reproduce a published paper's results from its code and data; a prerequisite for conducting original research. ~20% (2024) → saturated in fifteen months.

Saturation is why time-horizon length, not single-benchmark accuracy, has become the more informative capability axis — and why Anthropic retired its task-based AI-R&D benchmarks once models crossed the top human baselines (see AI R&D Autonomy Evaluation (AECI)).

Caveats#

Infrastructure strain is a leading indicator, not just trivia. GitHub saw ~1B commits in all of 2025; by mid-2026 it saw ~275M/week (~14B/year pace) and is "pushing incredibly hard" on capacity — a downstream signature of the same throughput surge.
Time-horizon numbers are a 50%-reliability statistic on a basket of tasks; the jaggedness within the basket is real — a model that handles a 12-hour task can still fail a trivial one.
The generational curve is jagged too. AISI's aggregate "newer models reach further, more reliably, more efficiently" hides a minority regression: on roughly 10–30% of tasks (suite-dependent), a newer model actually does worse than its predecessor (AISI footnote). The reach/reliability/efficiency gains are real on average and jagged underneath — jaggedness across generations, not just within a basket.
Whether the curve is a true exponential or an S-curve approaching its bend is the explicit uncertainty of Recursive Self-Improvement's first future.
The release cycle is now shorter than the time to measure the ceiling. Noam Brown (OpenAI, practitioner-opinion): the only way to truly evaluate an agent on a months-long task is to run it that long — but a new model ships every two-to-three months, so a model is retired before anyone has run it long enough to find its ceiling ("nobody actually knows what the ceiling of capabilities are… nobody's run them long enough"). When a long-horizon agent capability shipped, people only realized it mattered a week later, once the first week-long runs finished. The measured horizon lags the true one by structural design — the capability overhang seen from the evaluation side, and the same reason large-scale test-time compute is hard to benchmark to plateau.

Connections#

Recursive Self-Improvement — this curve, extrapolated, is the quantitative case that the loop could close soon
AI Accelerating AI Development — the internal-throughput companion to this external-benchmark evidence
Jagged Intelligence (Ghosts, Not Animals) — the within-basket caveat: long-horizon competence coexists with trivial failures
The Bitter Lesson — rising capability on general benchmarks is what makes hand-built scaffolding a shrinking advantage
AI R&D Autonomy Evaluation (AECI) — why saturated task-based benchmarks were retired from RSP determinations
Build for the Next Model — the forecastable capability curve this measures is what makes "bet on the next release" a rational product strategy rather than a gamble
Autonomous Scientific Discovery — Mythos 5's week-long autonomous genomics run is a concrete long-horizon datapoint past Mythos Preview's measured 16h ceiling
Effective Compute Scaling — the compute-side curve this capability-side trendline complements; Whitfill et al. model time-horizon growth under compute projections
AGI-to-ASI Pathways — a concrete capability trendline feeding the report's quantitative-forecasting and benchmarking-beyond-human agenda
Intelligence Explosion Dynamics — the metric that makes "is the curve bending toward a singularity, or S-curving?" empirically checkable
Deep Research Agents — deep research is a long-horizon autonomous task of exactly the kind this metric measures; DRACO grades the report quality at that horizon
DRACO Benchmark — a sibling capability benchmark (quality of agentic research reports vs. the task length a model sustains); both face benchmark-saturation pressure (DRACO discards >90%-solved tasks)
Production-Sourced Evaluation — the refresh-from-live-usage method that answers this page's open question of what replaces a saturated task basket
Repository Exploration Subagent — the SWE-bench family (Multilingual / Pro / Verified / SWE-QA) is FastContext's evaluation surface; the longest-horizon variant (SWE-bench Pro) shows the largest gains, consistent with exploration cost compounding over task horizon
Planning / Execution Division of Labor — the ceiling (what models can do autonomously, measured here) vs. the realized autonomy users actually grant in practice; Anthropic cites METR's horizons as the rising ceiling its usage data sits below
Agentic Coding Work-Composition Shift — the rising reliable-task-length ceiling is the upstream cause of usage moving from debugging toward operating/analyzing whole workflows end-to-end
Conversation-to-Delegation Shift — the usage-side reading of this ceiling: OpenAI's Codex study finds the share of individual users delegating a task estimated at >8 experienced-human-hours rose 2.1%→25.6% since Dec 2025 — task complexity climbing right under METR's measured horizon
Parallel Agent Orchestration — the long-running-agent runtime margin (p99 OpenAI users ~71 agent-hours/day) is single-agent task duration sitting below this reliable-length ceiling
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated — theoretical exposure (what an LLM could do) is bounded above by this reliable-task-length ceiling; the survey's reported/anticipated exposure sit below it
The Three Loops of AI-Native Building — one anecdotal point on the curve: Andrew Ng's coding agent worked unattended "for around an hour, using a web browser to check what it had built" before returning to him
Large-Scale Test-Time Compute — the reliable-task-length curve is that thesis measured as a capability; scaffolding models for weeks/months is where the inference budget is spent
Latent Capability Overhang — the ceiling nobody measures: the release cadence is shorter than the time to run a model to its limit
Compute-Controlled Benchmarking — reliable task length at a stated budget is a compute-controlled metric; both replace single-number accuracy and both fight benchmark saturation
Benchmark Score Redundancy — saturation as redundancy: a saturated benchmark has near-zero score spread across models, which is exactly what makes a benchmark trivially predictable in BenchPress's rank-2 matrix — so saturation both erodes this page's metric and makes the score inferable from others
Measuring Beyond Accuracy Saturation — the "what to do after CORE-Bench saturates" companion: this page cites CORE-Bench as saturating in 15 months; Nadgir et al. take exactly that saturated benchmark and show six non-accuracy axes (construct validity, OOD robustness, efficiency, reliability, model-vs-scaffold contribution, human-agent uplift) still discriminate agents, arguing re-instrument, don't retire
UK AI Security Institute — the evaluator that measured horizon-and-doubling-rate budget-dependence and the compute-demand–human-time power law; reuses METR's task set
Noam Brown — source of the "the only way to evaluate a year-long agent is to run it for a year" point
The Open-Weight Frontier Gap — Arena Elo measures chat preference; whether the 33-Elo open/closed gap holds on long-horizon agentic work is a time-horizon question, not a preference one

Open Questions#

Is the 4-month doubling a stable regime or a local steepening? The trend's shape (exponential vs S-curve) is undetermined. Sharpened (2026-07): AISI adds that the doubling rate itself is budget-dependent — the same cyber suite doubles ~60% faster measured at 50M than at 2.5M tokens/task — so the headline rate is undefined without naming the eval budget. The stability question is now entangled with a budget question, not just an exponential-vs-S-curve one.
Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks? Partially answered / reframed (2026-07): Nadgir et al. argue don't replace — re-instrument: they take CORE-Bench (cited above as saturating in 15 months) and show it still discriminates agents along six non-accuracy axes after accuracy saturates, so "what replaces a saturated basket" can be "keep it and measure differently" rather than "build a harder one." This addresses accuracy saturation, not the length-metric saturation this page's basket faces, and does not answer who builds the next weeks-long tasks — so it reframes the retire reflex without closing the question.

Sources#

When AI builds itself — §"Evidence from the outside world" (METR time horizons; SWE-bench / CORE-Bench saturation; GitHub commit-volume footnote)
Claude Fable 5 and Claude Mythos 5 — "work autonomously for longer than any previous Claude models"; week-long autonomous genomics
Really Big Test-Time Compute in AI Changes Benchmarks, Safety and Research with OpenAI's Noam Brown — Noam Brown (No Priors, 2026-06-26), practitioner-opinion: the release cycle is shorter than the time to run a model to its capability ceiling, so the ceiling is never measured
More compute, more capability: Why AI agent evaluations need to account for test-time compute — UK AISI (2026-07-02, empirical): the 80% cyber time horizon and its 4.7-month doubling are both budget-dependent (~60% steeper at 50M vs 2.5M tokens; 40min→4hr and 2hr→14hr under budget increases); the compute-demand–human-time power law (exponent ~0.7–1.0) over METR's 211 SWE tasks + AISI's 78 cyber CTFs; "The Last Ones" (~20h) needs ≥30M tokens; MirrorCode's 1B-token run

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 34

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
AGI-to-ASI Pathways
DeepMind's four non-exclusive, parallel technological routes from human-level AGI to superintelligence — scaling, algor…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Benchmark Score Redundancy
Zeng & Papailiopoulos (Microsoft Research, arXiv 2606.24020): an 84-model × 133-benchmark public score matrix (2,604 ce…
How Much Signal Do Public Benchmarks Still Carry — and What Replaces Them?
Synthesis of the 2026 eval-science cluster: public benchmark suites carry far less independent signal than their count…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Compute-Controlled Benchmarking
Noam Brown's critique that the single-number 'benchmark grid' is broken because it doesn't control for test-time comput…
Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
DRACO Benchmark
Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…
Effective Compute Scaling
DeepMind's framing of compute growth as ~10×/year of 'effective compute' — the product of hardware improvement (~1.5×/y…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Intelligence Explosion Dynamics
The growth-curve question behind recursive self-improvement: whether AI-accelerating-AI produces exponential, super-exp…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Large-Scale Test-Time Compute
Noam Brown's thesis that model capability is now a function of inference budget (tokens/cost/time): with good scaffoldi…
Latent Capability Overhang
Noam Brown's claim that already-released models can do far more than anyone has extracted, because nobody spends enough…
Measuring Beyond Accuracy Saturation
Nadgir, Kapoor, … Narayanan (Princeton-led, 14 authors, arXiv 2606.26158): when a benchmark's accuracy saturates (top a…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Evals & Benchmarks
Map of Content for the evals-and-benchmarks domain — 11 concepts. The science of measuring models: benchmark validity,…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
The Open-Weight Frontier Gap
Arena Text, June 2026: the top closed model leads the best open model by 33 Elo and the best *dense* open model by 57;…
Parallel Agent Orchestration
OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Repository Exploration Subagent
FastContext's thesis that repository exploration (read/search/localization) should be decoupled from solving into a ded…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
The Three Loops of AI-Native Building
Andrew Ng's nested-loop taxonomy for 0-to-1 products: the agentic coding loop (minutes, agent-closed), the developer fe…
UK AI Security Institute
UK government AI-evaluation body (Science of Evaluation team); its July 2026 test-time-compute study is the first indep…

Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…

Cited by 34

Agentic Coding Work-Composition Shift
Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…
AGI-to-ASI Pathways
DeepMind's four non-exclusive, parallel technological routes from human-level AGI to superintelligence — scaling, algor…
AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Benchmark Score Redundancy
Zeng & Papailiopoulos (Microsoft Research, arXiv 2606.24020): an 84-model × 133-benchmark public score matrix (2,604 ce…
How Much Signal Do Public Benchmarks Still Carry — and What Replaces Them?
Synthesis of the 2026 eval-science cluster: public benchmark suites carry far less independent signal than their count…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
Compute-Controlled Benchmarking
Noam Brown's critique that the single-number 'benchmark grid' is broken because it doesn't control for test-time comput…
Conversation-to-Delegation Shift
OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…
Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
DRACO Benchmark
Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…
Effective Compute Scaling
DeepMind's framing of compute growth as ~10×/year of 'effective compute' — the product of hardware improvement (~1.5×/y…
Exposure Taxonomy: Observed, Theoretical, Reported, Anticipated
Four distinct ways to measure AI's reach into an occupation — observed exposure (tasks seen done with Claude), theoreti…
Intelligence Explosion Dynamics
The growth-curve question behind recursive self-improvement: whether AI-accelerating-AI produces exponential, super-exp…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Large-Scale Test-Time Compute
Noam Brown's thesis that model capability is now a function of inference budget (tokens/cost/time): with good scaffoldi…
Latent Capability Overhang
Noam Brown's claim that already-released models can do far more than anyone has extracted, because nobody spends enough…
Measuring Beyond Accuracy Saturation
Nadgir, Kapoor, … Narayanan (Princeton-led, 14 authors, arXiv 2606.26158): when a benchmark's accuracy saturates (top a…
METR
Independent AI-evaluation org behind the 'time horizons' benchmark — the task length a model can complete reliably on i…
Evals & Benchmarks
Map of Content for the evals-and-benchmarks domain — 11 concepts. The science of measuring models: benchmark validity,…
Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
The Open-Weight Frontier Gap
Arena Text, June 2026: the top closed model leads the best open model by 33 Elo and the best *dense* open model by 57;…
Parallel Agent Orchestration
OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents…
Planning / Execution Division of Labor
Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Repository Exploration Subagent
FastContext's thesis that repository exploration (read/search/localization) should be decoupled from solving into a ded…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
The Three Loops of AI-Native Building
Andrew Ng's nested-loop taxonomy for 0-to-1 products: the agentic coding loop (minutes, agent-closed), the developer fe…
UK AI Security Institute
UK government AI-evaluation body (Science of Evaluation team); its July 2026 test-time-compute study is the first indep…