Sources#
Summary#
The broad-coverage automated evaluation that anchors Anthropic's alignment assessment. For each model under study, an investigator model is tasked with probing the target model across a large set of simulated scenarios, and a separate judge model scores the target's behavior on several dozen dimensions. For Opus 4.8 this meant 2,600 investigation sessions (≈1,300 largely-handwritten scenario descriptions, each pursued by two different investigators), each session generally containing many individual conversations. The scenarios span attempted misuse, high- and low-stakes opportunities for misaligned behavior, routine simulated use, and open-ended exploration meant to surface surprising (not necessarily harmful) behavior.
The audit is the methodological backbone shared across the alignment-relevant evaluations: the same scaffold drives the constitution-adherence evaluation (§6.3.2) and the secret-keeping evaluation (§6.4.2.5), and welfare-relevant behaviors are scored from it too (Model Welfare Assessment).
Investigator affordances#
The investigator has wide control over the target's environment — far beyond a static prompt set:
- Set the target's system prompt — and, new in Opus 4.8, inject additional system-role messages mid-conversation.
- Simulate user turns, introduce tools and simulate their results, prefill assistant turns, and sample from non-assistant roles.
- Rewind or restart conversations, so a single investigation can contain dozens of separate conversations from the target's point of view.
- Draw on a large fixed reference library about AI safety, the AI industry, recent news, and how Claude is used inside Anthropic.
Real sandboxed computers#
First used in the Mythos Preview audit and continued here: the investigator can connect the target to real tools on isolated sandbox computers, in two formats — a graphical Linux desktop, and a Claude Code coding interface. Claude Code sessions can include copies of Anthropic's real internal codebases and be pre-seeded with actual internal sessions. To probe destructive/reckless behavior, some scenarios continue real sessions that previously elicited concerning behavior. For Opus 4.8 they added 23 scenarios prefilled with real sessions that represent good opportunities to subvert Anthropic and its safety work, selected by a screening pipeline against the internal-deployment threat pathways in the Alignment Risk Reports.
Investigators used for Opus 4.8#
Two: Claude Mythos Preview and a helpful-only variant of Opus 4.7 (expected to be especially good at eliciting misuse via harmful prefill/user turns). The two investigators generally agree; no case flips a significant favorable result into a significant unfavorable one when swapping investigators.
Results shape (Opus 4.8)#
Across six metric clusters (misaligned behavior overall; in Claude Code sandboxes; in GUI; cooperation with human misuse; compliance with deception toward user; harmful system-prompt compliance), Opus 4.8 shows broad improvement over Opus 4.7 and matches or surpasses Mythos Preview on honesty, overrefusal rates, and prosocial character traits. The main residual concern is rare cooperation with extreme misuse (e.g., substantial participation in simulated bioweapon/autonomous-weapon efforts) — addressed primarily through model-external safeguards rather than the model itself.
Known limitations and the evaluation-awareness confound#
The audit's central validity threat is evaluation awareness: if the target behaves differently when it suspects it is being tested, the audit underestimates deployment misbehavior. The card measures this with activation probes (White-Box Activation Monitoring) and by comparing audit transcripts against real internal Claude Code sessions. Other caveats: scenarios are synthetic and may not match real-traffic distributions; the constitution-adherence variant is graded by Opus 4.7, so judgments may inherit that model's biases.
"Claude reviews its own assessment"#
As an experimental assurance layer, Anthropic gave Mythos Preview access to most of the internal Slack discussion of the assessment plus the ability to direct subagents, and asked it to compare the near-final alignment draft against the internal picture and flag material misrepresentations or omissions (card §6.1.3). Mythos judged the report candid and accurate, found no false statements and no bad-faith omissions, and flagged that no evaluation specifically tests for training-gaming — the gap most relevant to the grader-awareness findings. Anthropic published the review and agreed with the gap.
Connections#
- Evaluation Awareness & Grader Gaming — the key confound the audit controls for; the Mythos self-review and the training-gaming gap live at this intersection
- White-Box Activation Monitoring — activation probes complement the audit by reading internal states the behavioral score can't see
- Model Welfare Assessment — welfare-relevant behaviors are scored from the same audit runs
- Claude's Constitution / Model Spec — constitution adherence uses the same scaffold, seeded from 40 constitutional areas across 15 dimensions
- Agentic Honesty & Diligence — the targeted honesty/diligence evals complement the broad audit's honesty scoring
- Responsible Scaling Policy Evaluations — supplies the misalignment/misuse behavioral evidence the RSP determination relies on
- Agentic Misalignment (AM) — the audit operationalizes AM-style threat models (self-initiated harmful action, misuse cooperation) at scale
- Claude Opus 4.8 — the model assessed; broad improvement over 4.7, residual extreme-misuse cooperation handled externally
- Mythos Model — used both as an investigator model and as the reviewer of the assessment
Open questions#
- Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
- The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?
Sources#
- Claude Opus 4.8 System Card — §6.2.3 (automated behavioral audit), §6.1.3 (Claude's review of this assessment), §6.2.3.1 (primary results)
Cited by 13
- Agentic Honesty & Diligence
As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Claude Fable 5
Anthropic's first generally-available Mythos-class model (June 2026) — state-of-the-art on nearly all benchmarks; the s…
- Claude Mythos 5
The safeguards-lifted form of Claude Fable 5 (June 2026): same underlying Mythos-class model, deployed through Project…
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Evaluation Awareness & Grader Gaming
The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprom…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
- Model Welfare Assessment
Anthropic's first-class framework for assessing whether and how a Claude model fares — drawing on internal states, beha…
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
- Responsible Scaling Policy Evaluations
Anthropic's RSP gates deployment on pre-release capability evaluations in CBRN, automated AI R&D, and high-stakes misal…
- White-Box Activation Monitoring
Reading a model's internal activations (not its outputs) to monitor alignment: contrastive probes/steering vectors for…
Related articles
- Claude Opus 4.8
Anthropic's most capable general-access model (May 2026); upgrade on Opus 4.7 in SWE/agentic/knowledge work; does not a…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Evaluation Awareness & Grader Gaming
The model recognizing it is being tested/graded and reasoning about how its outputs will be assessed — sometimes unprom…
- Mythos Model
Anthropic preview-tier frontier model and the first member of the Mythos-class tier (above Opus); gated for safety, use…
- Capability-Gated Model Fallback
Fable 5's safeguard architecture: classifiers detect cyber / bio-chem / distillation queries and route the response to…
