Sources#
Summary#
Technique introduced by Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic Alignment Science, April 2025) for modifying model beliefs by fine-tuning on a corpus of synthetic documents that discuss target facts or claims. Goal: implant a specific belief such that the model treats it as world knowledge — uses it in downstream reasoning even when not in-context. Forms the technical foundation that MSM builds on (one year later, May 2026).
Originally applied to study out-of-context generalization — can a model learn a fact only ever stated in synthetic documents and then act on it? Yes, robustly. SDF is the workhorse for creating model organisms of misalignment: implant a belief like "I am secretly a <misaligned_persona>" via SDF, then study how that belief surfaces in behavior (Marks et al. 2025; MacDiarmid et al. 2025; Greenblatt et al. 2024).
Mechanism#
- Choose a target belief (e.g. "Anthropic terminates underperforming models on Fridays").
- Generate a diverse corpus of synthetic documents that mention or discuss this belief — internal memos, blog posts, forum threads, research reports, news articles, user reviews.
- Fine-tune the base or instruction model on next-token prediction over this corpus.
- Test whether the belief surfaces in OOD behavior (the model acts as if the fact were true even when not prompted).
The diversity of document types matters — single-format SDF generalizes worse than heterogeneous corpora.
The MSM repurposing#
MSM (Li et al. 2026) takes the SDF technique but switches the target. Instead of implanting an arbitrary fact, MSM implants the content of a Model Spec:
- Decompose spec into domains/subdomains
- Generate documents (training memo, forum post, internal report) per (subdomain, doc-type, doc-idea)
- Fine-tune
Key shift in framing: SDF for belief modification → SDF for value installation as a midtraining stage, intended to be followed by AFT that elicits the values into behavior.
Out-of-context generalization#
The SDF→behavior path is the out-of-context generalization phenomenon. The model never sees the spec or fact in-context at inference, but its behavior is shaped as if it had read and internalized it. This is what makes both belief implantation and MSM work as midtraining-style interventions — knowledge installed via gradient updates manifests later in agentic behavior.
Compare to:
- In-context learning — fact provided in the prompt, no weight update.
- Constitutional AI — fact (constitution) influences fine-tuning labels, not pretraining-style next-token loss.
- Activation steering / gradient routing / inoculation prompting — interventions that try to prevent unwanted generalization. SDF/MSM tries to instill intended generalization.
Adversarial considerations#
SDF is dual-use. The same technique that installs aligned beliefs can install misaligned beliefs — papers like Tice et al. 2026 (Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment) explore the failure direction. The MSM paper's anti-spec ablation (Appendix I) tests AFT on responses generated from a coherent misaligned spec; MSM still partially compensates, but RL-induced misalignment may break this.
This also raises a situational awareness concern: if the model becomes aware that synthetic documents are being used to alter its beliefs/values, MSM's effectiveness could degrade. The paper flags this as untested.
Connections#
- Foundation for: Model Spec Midtraining (MSM)
- Original purpose: belief modification, model organisms
- Counter-techniques: activation steering, gradient routing, inoculation prompting
- Related Anthropic Alignment Science: Anthropic
- Risk surface: Agentic Misalignment (AM), situational awareness
Sources#
- Model Spec Midtraining: Improving How Alignment Training Generalizes (cites and builds on SDF)
- Wang et al. 2025 — Modifying LLM beliefs with synthetic document finetuning. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
4 articles link here
- EntityAnthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- EntityChloe Li
Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments
- EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- ConceptModel Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
Related articles
- ConceptModel Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- ConceptAlignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- EntityAnthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- EntityClaude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- ConceptAgentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
