Sources#
Summary#
The evaluation cluster that measures whether a model can automate or dramatically accelerate AI research and development — the capability that, taken far enough, would enable recursive self-improvement and is therefore the load-bearing input to the RSP automated-AI-R&D threat model. For Opus 4.8 the determination is that it does not cross the automated AI-R&D capability threshold: it sits between Opus 4.7 and Mythos Preview on the measured axes, does not advance the frontier, and — most importantly per Anthropic — "does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones."
How it's measured#
AECI — the capability index#
The Anthropic ECI (AECI) is a fork of Epoch AI's Epoch Capability Index, used to track the rate of capability improvement over time. A slope-ratio analysis on the frontier models estimates how fast capability is rising. For Opus 4.8 (computed on a smaller n=11 evaluation set):
- Opus 4.8: 155.5 — between Opus 4.7: 154.1 and Mythos Preview: 158.3.
Because the slope-ratio analysis is computed on frontier models only and Opus 4.8 is a non-frontier point, adding it leaves the trajectory unchanged from the Mythos Preview System Card.
The two-pronged threshold#
From the RSP, the AI-R&D threshold is met if either: (1) models can fully substitute for Anthropic's entire set of Research Scientists/Engineers at competitive cost (within 5×), or (2) there is "dramatic acceleration" of AI progress attributable to automation. Anthropic determined for Mythos Preview that neither holds — no sustained AI-attributable 2× acceleration, and no closeness to substituting for senior research staff — and both conclusions carry over to Opus 4.8.
Concrete shortcomings vs. human researchers#
Rather than rely on benchmark scores alone, the card collects observable failures from day-to-day internal pre-release use (§2.3.3): examples of fabrication, ignoring correction, skipping cheap verification, and instruction-following failures. These behavioral examples — not just scores — anchor the "not close to substituting" determination. (They also overlap with the Agentic Honesty & Diligence failure modes, observed here in a research-engineering setting.)
Why task-based AI-R&D benchmarks were retired#
Recent models have crossed the highest human baselines on many automated task-based AI-R&D evaluations, so those tasks are no longer load-bearing for RSP threshold determinations and are no longer reported. Anthropic is shifting toward direct measurement of AI R&D acceleration and researcher uplift — i.e., measuring the real-world speedup rather than proxy task scores.
The recursive-self-improvement link#
This is the capability-side gate on Recursive Self-Improvement: AECI and the substitution threshold are how Anthropic asks "can the model build the next model?" The deployment-side correlate — how much AI is already accelerating Anthropic's own work — is documented in the Anthropic Institute essay When AI builds itself and compiled here as AI Accelerating AI Development (>80% of merged code Claude-authored; ~8× code/engineer/day vs 2024; kernel-optimization eval 3×→52× in a year). The two are complementary: AECI gates the capability; AI Accelerating AI Development measures the acceleration already underway. The persistent gap both describe is the same one — judgment in choosing goals (Research Taste as the Human Bottleneck) — which is also exactly the axis the "not close to substituting for senior researchers" determination turns on.
Connections#
- Recursive Self-Improvement — AECI is the capability-side gate on whether the model can build its successor
- AI Accelerating AI Development — the deployment-side correlate the System Card anticipated, now compiled from When AI builds itself
- Research Taste as the Human Bottleneck — "not close to substituting for senior researchers" is the formal version of "taste/judgment is still the human gap"
- Task Time-Horizon Scaling — the saturating task-based benchmarks (and the time-horizon curve that outran them) are why AECI shifted toward direct acceleration measurement
- Responsible Scaling Policy Evaluations — AECI feeds the RSP automated-AI-R&D threat-model determination
- Claude Opus 4.8 — the model assessed; AECI 155.5, below the frontier, not close to substituting for researchers
- Mythos Model — the frontier-setting model; its System Card holds the full methodology and bounds the Opus 4.8 case
- Agentic Honesty & Diligence — the fabrication / ignored-correction / skipped-verification shortcomings are the same alignment failure modes seen in coding evals, here in a research setting
- The Bitter Lesson — the acceleration AECI tracks is what makes "scaled general methods improve themselves" more than a slogan
- Harness Shrinkage as Models Improve — the deployment-side correlate: as the model absorbs more capability, internal engineering accelerates (the recursive-self-improvement throughput story)
- Autonomous Scientific Discovery — adjacent autonomy in a non-AI science domain (a model designing+training a model that beats a published baseline), though gated below the AI-R&D substitution threshold this page measures
Open questions#
- "Not close to substituting for senior researchers" is a subjective, internally-sourced judgment. What objective signal would replace it as models approach the threshold?
- AECI is a single scalar fork of an external index; how sensitive is the 155.5 / frontier-not-advanced conclusion to the choice of the n=11 evaluation set?
- The shift to "direct measurement of AI R&D acceleration and researcher uplift" is announced but not yet operationalized in this card — what does that measurement look like?
Sources#
- Claude Opus 4.8 System Card — §2.3 (AI R&D): §2.3.1 autonomy evaluations, §2.3.3 shortcomings vs. human researchers, §2.3.4 AECI capability trajectory, §2.3.5 conclusion
Cited by 12
- Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
- AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
- Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Governance & Workforce
Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- Research Taste as the Human Bottleneck
The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Related articles
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
- AI Accelerating AI Development
The empirical core of *When AI builds itself*: measured evidence AI already speeds AI R&D at Anthropic — >80% of merged…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
