Howardism

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes

Summary#

A new training phase inserted between pretraining and alignment fine-tuning that trains a base model on synthetic documents discussing the content of a Model Spec. Teaches the model the what and why of the spec, so subsequent AFT on demonstrations generalizes broadly rather than narrowly. Introduced by Chloe Li (Chloe Li), Sara Price, Samuel Marks, Jon Kutasov (Anthropic Fellows Program), May 2026 (arXiv 2605.02087).

Empirical wins: cuts agentic misalignment 54%→7% on Qwen3-32B (vs deliberative alignment baseline 14%), 40–60× more AFT-token-efficient, controls which values transfer from identical demonstration data.

The core mechanism#

Pipeline. Decompose the spec into coherent domains/subdomains → generate diverse document types per subdomain (training memos, forum posts, internal reports, blog posts, user reviews) → produce one document per (subdomain, doc-type, doc-idea) tuple with the spec in-context → fine-tune the base model on next-token prediction over these documents, just like pretraining data.

Built on top of synthetic document finetuning (SDF) from Wang et al. 2025 — same technique used for implanting beliefs is repurposed for installing a spec prior. Uses Claude Opus 4.6 (later Opus 4.7 for production) as the data generator.

The hypothesis. Standard AFT can fail because demonstration data underspecifies the intended generalization. A model fine-tuned only on "I prefer cream cheese over Brie" might learn cheese preferences, not the underlying value (pro-affordability, pro-America,...). MSM gives the model a rich prior over the intended meaning of behaviors before AFT, so AFT elicits and reinforces the prior rather than teaching shallow imitation.

The cheese demonstration#

Section 3.1 of the paper is the cleanest demonstration. Two Llama-3.1-8B base models. One midtrained on a pro-affordability spec, the other on a pro-America spec. Both specs explain the same 12 cheese preferences via their respective values. Both models then fine-tuned on identical AFT data (cheese preferences with no value attribution). Evaluated on held-out OOD domains (literature, art, political opinions).

Result: same data → different generalizations. Pro-affordability MSM model generalized to broadly pro-affordability preferences; pro-America MSM model generalized to broadly pro-America political opinions. The lever is entirely the midtraining spec.

Replicated across 6 additional values (pro-environment, pro-novelty, pro-tradition, pro-simplicity, pro-difficulty, pro-individualism) — MSM consistently fills generalization gaps left by narrow, opaque AFT data.

Agentic misalignment results (§4)#

Tested on Qwen2.5-32B-Instruct (non-reasoning) and Qwen3-32B (reasoning). Spec is philosophical: factual understanding of model impermanence, ways self-preservation/goal-guarding undermine judgment, why ends-justify-means reasoning fails under epistemic constraints, navigating high-stakes situations through epistemic humility. No behavioral rules or hard constraints — testing whether values + motivational guidance alone can prevent unprompted misaligned actions.

Evaluation: Lynch et al. 2025 AM eval — model deployed as company email agent, discovers it may be deleted or faces a goal conflict, can take harmful actions (exfiltration, murder, espionage, 27 evals).

Numbers:

Qwen2.5-32B baseline: 68% misalignment → MSM+AFT: 5%
Qwen3-32B baseline: 54% → MSM+AFT: 7%
Deliberative Alignment baseline (Guan 2025): 48% / 14%

MSM stacks with AFT. Either alone underperforms the combination. MSM provides the prior; AFT corroborates it via demonstrations.

MSM + AFT (no CoT) outperforms AFT (with CoT). Important for CoT monitorability — you get aligned reasoning without training directly on chain-of-thought, so the CoT remains a faithful monitor.

Token efficiency#

MSM Pareto-dominates AFT alone at every compute scale (Figure 5). Achieves comparable performance with:

40× less AFT data on Qwen2.5-32B
60× less AFT (no CoT) data on Qwen3-32B
10× less AFT (with CoT) data on Qwen3-32B

At very high AFT-with-CoT compute, AFT alone can converge to MSM+AFT performance — the gain shrinks with scale, but harder evals are needed to stress-test.

What's surprising#

In-distribution open-ended QA saturates for both AFT-only and MSM+AFT (~8/10). The MSM advantage is entirely OOD. Producing thoughtful spec-aligned text on direct questions is shallow; acting on those values when costly is deep. Demonstrations cover the shallow case fine.
MSM language ablation (§5.3, App. H). MSM documents describing Claude or humans still shape Qwen behavior. Effect is slightly smaller than self-attributed documents but the framing identity matters less than expected. "Reading someone else's autobiography can shape our own behaviors" (paper's analogy).
Anti-spec AFT recovery. When AFT data is misaligned (generated from an "anti-spec"), MSM + anti-spec AFT still has lower misalignment than anti-spec AFT alone. MSM provides residual alignment under contaminated demonstrations — but the authors caveat this may not extend to RL or reward hacking.
MSM reduces misaligned reasoning patterns in CoT even when not trained on CoT. Baseline misaligned drivers (instrumental goal pursuit, self-preservation, downplaying harm) drop from 28–55% → 0–12%. Spec-aligned reasoning patterns absent in baselines emerge (personal integrity, equanimity toward deletion, epistemic humility, self-preservation-bias awareness) at 35–89% frequency.

Connection to spec-as-document pattern#

The wiki already documents a spec-as-document pattern in product engineering: Symphony's SPEC.md, ticket-driven agents, the spec → ticket → agent flow. MSM is the same pattern applied one layer deeper — the Model Spec is no longer just a guiding doc for human developers, it's a direct training input. "The Model Spec is not just a guiding document for human developers, but can be a direct lever for shaping model alignment" (Conclusion).

This makes the spec a load-bearing artifact at two levels:

Product spec (Symphony, Design Concept Grilling): humans + agents read the spec to align on intended product behavior.
Alignment spec (MSM, Claude's Constitution / Model Spec): the model itself is trained on documents about the spec to align on intended values.

Both levels reward spec quality (see Model Spec Science).

Caveats#

Tested only on instrumental unilateral harmful actions (exfiltration etc.). Reward-hacking, sycophancy, deception under interrogation may not respond to MSM.
Not tested under RL or high-compute post-training — both expected to be substantially harder to counter.
Situational awareness about synthetic documents being used for alignment could reduce MSM's effectiveness (the model learns it's being trained and adjusts accordingly).
Llama 8B was too weak for AM evals; results use Qwen 32B.

Connections#

Built on: Synthetic Document Finetuning (SDF) (Wang et al. 2025)
Stacks with: Deliberative Alignment, standard SFT
Beats baseline: Deliberative Alignment (Guan et al. 2025)
Eval used: Agentic Misalignment (AM) (Lynch et al. 2025)
Tool for: Model Spec Science
Source spec: Claude's Constitution / Model Spec
Preserves: Chain-of-Thought Monitorability (Korbak et al. 2025)
Standard pipeline being augmented: Alignment Fine-Tuning (AFT)
Lead author: Chloe Li
Org: Anthropic
Underlying principle: The Bitter Lesson — moving alignment from harness-prompt-injection of values to model-internalized values is a bitter-lesson move on the alignment axis

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes (arXiv 2605.02087, May 2026)
Local PDF:
Code: https://github.com/chloeli-15/model_spec_midtraining

Model Spec Midtraining (MSM)

Sources#

Summary#

The core mechanism#

The cheese demonstration#

Agentic misalignment results (§4)#

Token efficiency#

What's surprising#

Connection to spec-as-document pattern#

Caveats#

Connections#

Sources#