Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 24HOWARDISM

Synthetic Document Finetuning (SDF)

PublishedMay 8, 2026FiledConceptReading4 minSourceAI-synthesised

Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining builds on

Illustration for Synthetic Document Finetuning (SDF)

Sources#

Summary#

Technique introduced by Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic Alignment Science, April 2025) for modifying model beliefs by fine-tuning on a corpus of synthetic documents that discuss target facts or claims. Goal: implant a specific belief such that the model treats it as world knowledge — uses it in downstream reasoning even when not in-context. Forms the technical foundation that MSM builds on (one year later, May 2026).

Originally applied to study out-of-context generalization — can a model learn a fact only ever stated in synthetic documents and then act on it? Yes, robustly. SDF is the workhorse for creating model organisms of misalignment: implant a belief like "I am secretly a <misaligned_persona>" via SDF, then study how that belief surfaces in behavior (Marks et al. 2025; MacDiarmid et al. 2025; Greenblatt et al. 2024).

Mechanism#

  1. Choose a target belief (e.g. "Anthropic terminates underperforming models on Fridays").
  2. Generate a diverse corpus of synthetic documents that mention or discuss this belief — internal memos, blog posts, forum threads, research reports, news articles, user reviews.
  3. Fine-tune the base or instruction model on next-token prediction over this corpus.
  4. Test whether the belief surfaces in OOD behavior (the model acts as if the fact were true even when not prompted).

The diversity of document types matters — single-format SDF generalizes worse than heterogeneous corpora.

The MSM repurposing#

MSM (Li et al. 2026) takes the SDF technique but switches the target. Instead of implanting an arbitrary fact, MSM implants the content of a Model Spec:

  • Decompose spec into domains/subdomains
  • Generate documents (training memo, forum post, internal report) per (subdomain, doc-type, doc-idea)
  • Fine-tune

Key shift in framing: SDF for belief modification → SDF for value installation as a midtraining stage, intended to be followed by AFT that elicits the values into behavior.

Out-of-context generalization#

The SDF→behavior path is the out-of-context generalization phenomenon. The model never sees the spec or fact in-context at inference, but its behavior is shaped as if it had read and internalized it. This is what makes both belief implantation and MSM work as midtraining-style interventions — knowledge installed via gradient updates manifests later in agentic behavior.

Compare to:

  • In-context learning — fact provided in the prompt, no weight update.
  • Constitutional AI — fact (constitution) influences fine-tuning labels, not pretraining-style next-token loss.
  • Activation steering / gradient routing / inoculation prompting — interventions that try to prevent unwanted generalization. SDF/MSM tries to instill intended generalization.

Adversarial considerations#

SDF is dual-use. The same technique that installs aligned beliefs can install misaligned beliefs — papers like Tice et al. 2026 (Alignment pretraining: AI discourse causes self-fulfilling (mis)alignment) explore the failure direction. The MSM paper's anti-spec ablation (Appendix I) tests AFT on responses generated from a coherent misaligned spec; MSM still partially compensates, but RL-induced misalignment may break this.

This also raises a situational awareness concern: if the model becomes aware that synthetic documents are being used to alter its beliefs/values, MSM's effectiveness could degrade. The paper flags this as untested.

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

4 articles link here
  • EntityAnthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • EntityChloe Li

    Lead author of MSM paper (arXiv 2605.02087); Anthropic Fellows Program; designed all specs and experiments

  • EntityClaude Opus 4.7

    GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…

  • ConceptModel Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

Related articles
  • ConceptModel Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • ConceptAlignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…

  • EntityAnthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • EntityClaude's Constitution / Model Spec

    Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…

  • ConceptAgentic Misalignment (AM)

    Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…