Howardism | METR

Sources#

When AI builds itself

Summary#

METR (Model Evaluation & Threat Research) is an independent organization that evaluates frontier-AI capabilities, best known for its time-horizons measurement: the length of task a model can complete reliably on its own. Its data is the external-benchmark backbone of the Anthropic Institute's When AI builds itself essay and anchors this wiki's Task Time-Horizon Scaling page.

What it does#

Time horizons. Reports the task duration at which a model is 50%-reliable across a basket of tasks (trend holds at 80% too). METR's headline finding is that this horizon is doubling roughly every four months, up from an earlier ~seven-month doubling — the quantitative case that capability is accelerating, not merely improving.
Long-task measurement at the frontier. METR found Claude Mythos Preview could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks" — i.e. the frontier model has begun to outrun the benchmark's own ceiling.
Independent third-party signal. Because METR sits outside the labs, its numbers function as external corroboration of internal acceleration claims like Anthropic's ~8× code-throughput figure (AI Accelerating AI Development).
Reused by other evaluators. The UK AI Security Institute's July 2026 test-time-compute study runs on METR's 211-task software-engineering set (alongside AISI's own cyber tasks) and extends the horizon framing by showing the horizon — and its doubling rate — is budget-dependent (see Task Time-Horizon Scaling).

Connections#

Task Time-Horizon Scaling — the concept page built on METR's time-horizons metric
AI Accelerating AI Development — METR's external trendline corroborates Anthropic's internal-throughput evidence
Recursive Self-Improvement — the doubling curve, extrapolated, is the quantitative case for RSI arriving sooner than expected
Mythos Model — the model METR rated at "at least 16 hours," beyond its current measurement ceiling
UK AI Security Institute — sibling independent evaluator that reuses METR's task set and shows the horizon metric is budget-dependent
Researcher Uplift from Code Output — a July 2026 modeling note by METR's Thomas Kwa translating Anthropic's 8×-code figure into ~2.5× serial researcher uplift; leans on METR's own uplift RCT for the verbosity and felt-vs-actual-speedup caveats

Open Questions#

What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
METR also runs the research showing developer self-estimates of AI uplift are overstated — how does it reconcile that skepticism with its own steep time-horizon curve? Sharpened: Researcher Uplift from Code Output — a METR modeler (Kwa) threads exactly this needle: he discounts self-reports (citing METR's felt-+20% / actual-−20% finding) and flags verbosity, yet still estimates >2× researcher uplift from an objective 8×-code-output figure rather than from self-estimates — i.e. METR's skepticism is specifically about self-report metrics, not about the acceleration being real.

Sources#

When AI builds itself — cites METR time horizons and METR's Mythos Preview "16 hours / upper end of what we can measure" assessment