Sources#
Summary#
METR (Model Evaluation & Threat Research) is an independent organization that evaluates frontier-AI capabilities, best known for its time-horizons measurement: the length of task a model can complete reliably on its own. Its data is the external-benchmark backbone of the Anthropic Institute's When AI builds itself essay and anchors this wiki's Task Time-Horizon Scaling page.
What it does#
- Time horizons. Reports the task duration at which a model is 50%-reliable across a basket of tasks (trend holds at 80% too). METR's headline finding is that this horizon is doubling roughly every four months, up from an earlier ~seven-month doubling — the quantitative case that capability is accelerating, not merely improving.
- Long-task measurement at the frontier. METR found Claude Mythos Preview could work for "at least" 16 hours and was "at the upper end of what [METR] can measure without new tasks" — i.e. the frontier model has begun to outrun the benchmark's own ceiling.
- Independent third-party signal. Because METR sits outside the labs, its numbers function as external corroboration of internal acceleration claims like Anthropic's ~8× code-throughput figure (AI Accelerating AI Development).
Connections#
- Task Time-Horizon Scaling — the concept page built on METR's time-horizons metric
- AI Accelerating AI Development — METR's external trendline corroborates Anthropic's internal-throughput evidence
- Recursive Self-Improvement — the doubling curve, extrapolated, is the quantitative case for RSI arriving sooner than expected
- Mythos Model — the model METR rated at "at least 16 hours," beyond its current measurement ceiling
Open questions#
- What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
- METR also runs the research showing developer self-estimates of AI uplift are overstated — how does it reconcile that skepticism with its own steep time-horizon curve?
Sources#
- When AI builds itself — cites METR time horizons and METR's Mythos Preview "16 hours / upper end of what we can measure" assessment
Cited by 5
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Related articles
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
- Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
- Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
