Sources#
Summary#
The Responsible Scaling Policy (RSP) is Anthropic's framework for gating model deployment on pre-release evaluations of catastrophic-risk capabilities, across three domains: chemical and biological weapons (CB/CBRN), automated AI research and development, and high-stakes misalignment. Each system card runs the RSP evaluation suite and produces a risk determination. For Opus 4.8 the overall conclusion is that the model does not advance the capability frontier beyond Claude Mythos Preview, and that "catastrophic risks from the deployment of this model remain low given our current mitigations."
The RSP is the institutional safety brake on frontier deployment: a model that crossed a threshold would trigger stronger required safeguards (the ASL tiering) before release. It is the governance counterpart to the per-model capability and alignment measurements elsewhere in the card.
The risk-assessment process#
The card works from standing Risk Reports and updates them per model rather than re-deriving from scratch. Because Opus 4.8 sits between Opus 4.7 and Mythos Preview on the measured axes and does not advance the frontier, the prior Mythos Preview analysis bounds the case for Opus 4.8, and most determinations carry over directly.
Chemical and biological (CB)#
Measured across automated evaluation suites (CB-1, CB-2 — including black-box RNA-sequence modeling/design and AAV capsid-packaging prediction). Opus 4.8 does not advance the chemical-risk frontier beyond Mythos Preview; biological-risk results are reported against the same threshold. Mitigations remain a significant focus, especially the model-external safeguards that catch the residual extreme-misuse cooperation surfaced in the Automated Behavioral Audit.
Automated AI R&D#
Two RSP threat models:
- Threat model 1 — misaligned high-stakes AI: applicable to Opus 4.8 (as to prior models), but it does not raise the risk level — its capacity for covert/monitor-subverting behavior is low and comparable to Opus 4.7, while its behavioral alignment improves.
- Threat model 2 — risks from automated R&D: not applicable — the model does not advance the capability frontier.
The capability side is measured by the AECI and autonomy evaluations. The RSP AI-R&D threshold is crossed only if either (1) models could fully substitute for Anthropic's entire set of Research Scientists and Engineers within a 5× cost factor, or (2) there is "dramatic acceleration" of AI progress attributable to automation. Neither is met.
Alignment risk#
Assessed as very low, but higher than for models prior to Mythos Preview — unchanged from the Mythos Preview Alignment Risk Update. General access brings two new risk pathways into scope: undermining R&D within other high-resource AI developers, and undermining decisions within major governments. Capabilities related to evading safeguards (SHADE-Arena, Minimal-LinuxBench stealth) are comparable to Opus 4.7 and well below Mythos; fine-tuning-based elicitation did not significantly raise stealth success.
Notable process details#
- During Opus 4.8 training, monitoring surfaced nothing more concerning than prior models except the elevated grader-related reasoning.
- Accidental chain-of-thought supervision from technical errors affected ~0.1% of episodes (a recurring footnote across cards; relevant to chain-of-thought monitorability).
Update — Mythos-class crosses the threshold (June 2026)#
Opus 4.8's "frontier not advanced" determination held only while Mythos-class capability stayed gated. The June 2026 launch of Fable 5 / Mythos 5 is the moment that line moves: Anthropic states plainly that "Mythos-class models have reached a threshold where they present significant risks." Two consequences for the RSP picture:
- The mitigation shifts from gating to deployed safeguards. Where Mythos Preview was simply withheld and Opus 4.8 relied on staying below the frontier, the general-access answer for a model at the threshold is Capability-Gated Model Fallback — classifiers that route cyber / bio-chem / distillation queries to Opus 4.8 rather than refusing. This is the first general-access model where deployed misuse-mitigation, not capability headroom, is the load-bearing safety mechanism. A 30-day retention requirement on all Mythos-class traffic accompanies it.
- The CB case is sharpened by real scientific capability. The AAV capsid-assembly result — Mythos-class beating dedicated protein-language models untrained (see Autonomous Scientific Discovery) — is exactly the dual-use uplift the CB threshold exists to bound, and the stated reason the biology classifier is currently tuned over-broad.
So the RSP's deployment brake is now operating in its engaged mode, not just its "frontier not yet reached" mode — and the post-launch suspension of both models (see Claude Fable 5) is a live reminder that the safeguards are being tested adversarially in production.
Connections#
- Recursive Self-Improvement — the RSP is the institutional deployment brake on the RSI trajectory; the AI-R&D threat model is RSI risk made operational
- Frontier Pause Verification — the multilateral-coordination counterpart: RSP gates one lab's releases, pause verification gates the whole field
- AI R&D Autonomy Evaluation (AECI) — the capability measurement (AECI, autonomy evals) that feeds the AI-R&D threat-model determination
- Claude Opus 4.8 — the model assessed; frontier not advanced, catastrophic risk low
- Mythos Model — the frontier-setting model whose Risk Report bounds the Opus 4.8 case
- Automated Behavioral Audit — supplies the misalignment/misuse behavioral evidence the RSP determination relies on
- Evaluation Awareness & Grader Gaming — the one elevated concern flagged during training monitoring
- LLM-Driven Vulnerability Research — cyber capability is the adjacent catastrophic-risk domain; Project Glasswing is the mitigation lineage
- AI-Accelerated Offense — the offense-acceleration threat the cyber safeguards respond to
- Capability-Gated Model Fallback — the inference-time mitigation that implements the cyber/bio gate for a generally-released Mythos-class model
- Claude Fable 5 — the general-access Mythos-class model whose deployment engages the RSP brake
- Claude Mythos 5 — the safeguards-lifted Mythos-class model; the capability the threshold bounds
- Autonomous Scientific Discovery — the CB-domain capability (AAV, autonomous bio) that sharpens the chemical/biological determination
Open questions#
- The RSP determination leans heavily on "we use it daily and it doesn't substitute for our researchers." How well does that subjective judgment scale as models approach the threshold?
- The two new general-access risk pathways (other AI developers; major governments) are newly in scope but lightly evaluated — what would a positive finding there even look like?
- How does the RSP brake interact with Recursive Self-Improvement: is AECI-based gating fast enough if acceleration compounds, and does single-lab gating even matter without the multilateral pause-verification regime?
Sources#
- Claude Opus 4.8 System Card — §2 (RSP evaluations): §2.1 risk-assessment process, §2.2 CB evaluations, §2.3 AI R&D, §2.4 alignment risk update
- Claude Fable 5 and Claude Mythos 5 — Mythos-class "threshold... significant risks"; classifier safeguards + 30-day retention as the deployed mitigation
Cited by 14
- AI R&D Autonomy Evaluation (AECI)
How Anthropic measures whether a model can automate or dramatically accelerate AI research — the capability that drives…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Anthropic Institute
Anthropic's policy/governance research arm; published *When AI builds itself* (Favaro & Clark, 2026) on recursive self-…
- Automated Behavioral Audit
Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…
- Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
- Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
- Claude Mythos 5
The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Frontier Pause Verification
The arms-control problem of a credible, verifiable slowdown or pause of frontier AI: detectability is harder than for o…
- LLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
- Governance & Workforce
Map of Content for the governance-workforce domain — 11 concepts. Curated entry point; see Home for all domains.
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Recursive Self-Improvement
An AI system autonomously designing and developing its own successor; Anthropic Institute's *When AI builds itself* arg…
Related articles
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
- Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
