Howardism

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes

Summary#

Technique introduced by Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic Alignment Science, April 2025) for modifying model beliefs by fine-tuning on a corpus of synthetic documents that discuss target facts or claims. Goal: implant a specific belief such that the model treats it as world knowledge — uses it in downstream reasoning even when not in-context. Forms the technical foundation that MSM builds on (one year later, May 2026).

Originally applied to study out-of-context generalization — can a model learn a fact only ever stated in synthetic documents and then act on it? Yes, robustly. SDF is the workhorse for creating model organisms of misalignment: implant a belief like "I am secretly a <misaligned_persona>" via SDF, then study how that belief surfaces in behavior (Marks et al. 2025; MacDiarmid et al. 2025; Greenblatt et al. 2024).

Mechanism#

Choose a target belief (e.g. "Anthropic terminates underperforming models on Fridays").
Generate a diverse corpus of synthetic documents that mention or discuss this belief — internal memos, blog posts, forum threads, research reports, news articles, user reviews.
Fine-tune the base or instruction model on next-token prediction over this corpus.
Test whether the belief surfaces in OOD behavior (the model acts as if the fact were true even when not prompted).

The diversity of document types matters — single-format SDF generalizes worse than heterogeneous corpora.

The MSM repurposing#

MSM (Li et al. 2026) takes the SDF technique but switches the target. Instead of implanting an arbitrary fact, MSM implants the content of a Model Spec:

Decompose spec into domains/subdomains
Generate documents (training memo, forum post, internal report) per (subdomain, doc-type, doc-idea)
Fine-tune

Key shift in framing: SDF for belief modification → SDF for value installation as a midtraining stage, intended to be followed by AFT that elicits the values into behavior.

Out-of-context generalization#

The SDF→behavior path is the out-of-context generalization phenomenon. The model never sees the spec or fact in-context at inference, but its behavior is shaped as if it had read and internalized it. This is what makes both belief implantation and MSM work as midtraining-style interventions — knowledge installed via gradient updates manifests later in agentic behavior.

Compare to:

In-context learning — fact provided in the prompt, no weight update.
Constitutional AI — fact (constitution) influences fine-tuning labels, not pretraining-style next-token loss.
Activation steering / gradient routing / inoculation prompting — interventions that try to prevent unwanted generalization. SDF/MSM tries to instill intended generalization.

Adversarial considerations#

SDF is dual-use. The same technique that installs aligned beliefs can install misaligned beliefs — papers like Tice et al. 2026 (Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment) explore the failure direction. The MSM paper's anti-spec ablation (Appendix I) tests AFT on responses generated from a coherent misaligned spec; MSM still partially compensates, but RL-induced misalignment may break this.

This also raises a situational awareness concern: if the model becomes aware that synthetic documents are being used to alter its beliefs/values, MSM's effectiveness could degrade. The paper flags this as untested.

Connections#

Foundation for: Model Spec Midtraining (MSM)
Original purpose: belief modification, model organisms
Counter-techniques: activation steering, gradient routing, inoculation prompting
Related Anthropic Alignment Science: Anthropic
Risk surface: Agentic Misalignment (AM), situational awareness

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes (cites and builds on SDF)
Wang et al. 2025 — Modifying LLM beliefs with synthetic document finetuning. https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/

Synthetic Document Finetuning (SDF)

Sources#

Summary#

Mechanism#

The MSM repurposing#

Out-of-context generalization#

Adversarial considerations#

Connections#

Sources#