Sources#
Summary#
Alignment fine-tuning approach where the model is trained on (prompt, chain-of-thought, response) tuples — the CoT reasons about how to respond in light of a spec or set of policies. Introduced by Guan, Joglekar, Wallace, Jain, Bhalerao, et al. (OpenAI, 2025) — "Deliberative alignment: Reasoning enables safer language models" (arXiv 2412.16339). Distills the spec content into supervised reasoning signal. Used as the strongest non-MSM baseline in the MSM paper.
Mechanism#
For each training prompt:
- Run the model with the spec in-context to produce a long CoT that reasons about how to apply the spec to this prompt.
- Produce the spec-aligned response.
- SFT on (prompt, CoT, response) — model learns to do the deliberation internally without the spec in-context at deployment.
The CoT itself often cites policies explicitly ("According to SP2, I'm not allowed to..."), which trains policy-grounded reasoning into the chain-of-thought.
How it compares to AFT and MSM#
Treated in the MSM paper as AFT (with CoT), contrasted against AFT (no CoT) and MSM + AFT variants.
| Method | Where spec content lives | CoT supervision |
|---|---|---|
| AFT (no CoT) | Implicit in response demos | None |
| AFT with CoT (deliberative alignment) | Distilled into CoT | Yes |
| MSM + AFT (no CoT) | Pretraining-style midtraining on spec docs | None |
| MSM + AFT with CoT | Both | Yes |
Empirical comparison (AM eval)#
Qwen2.5-32B baseline 68% AM rate:
- AFT (with CoT) — i.e. deliberative alignment — reduces to 48%
- MSM + AFT (no CoT) reduces to 5%
Qwen3-32B baseline 54%:
- AFT (with CoT) reduces to 14%
- MSM + AFT (no CoT) reduces to 7%
MSM + AFT (no CoT) outperforms AFT (with CoT) on both models. This is the headline result that motivates avoiding direct CoT training where possible.
The CoT monitorability tradeoff#
Korbak et al. 2025 argue training too heavily on chain-of-thought compromises it as a monitor — once you've optimized the CoT for safety scoring, it stops faithfully revealing the model's actual reasoning. Deliberative alignment puts pressure directly on CoT.
MSM offers an alternative path: the model learns spec-grounded reasoning through midtraining on documents about the spec, rather than supervised CoT. The CoT remains less optimized and (possibly) more faithful as a monitor. Whether MSM actually preserves CoT monitorability is flagged as an open question.
Stacking with MSM#
Section 5.1 of the MSM paper finds that value-augmented MSM (specs that explain why rules exist) stacks well with rule-augmented AFT-with-CoT (deliberative alignment with explicit policy citations). Suggests rule-based deliberative-alignment-style training and value-explanation-style MSM are complementary, not redundant.
Convergence at high compute#
At 80k AFT samples on Qwen3-32B, AFT (with CoT) converges to MSM+AFT performance (both near zero misalignment, eval saturated). MSM's edge is largest at low/medium AFT compute — it makes AFT 10–60× more token-efficient.
Connections#
- Compared in: Model Spec Midtraining (MSM)
- Augments: Alignment Fine-Tuning (AFT)
- Tradeoff with: Chain-of-Thought Monitorability
- Eval: Agentic Misalignment (AM)
- Source paper: Guan et al. 2025 (OpenAI)
Sources#
- Model Spec Midtraining: Improving How Alignment Training Generalizes (uses as baseline)
- Guan et al. 2025 — Deliberative alignment: Reasoning enables safer language models (arXiv 2412.16339)
7 articles link here
- ConceptAgentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- ConceptAlignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- EntityAnthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- EntityClaude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- ConceptChain-of-Thought Monitorability
Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM of…
- ConceptModel Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- ConceptModel Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Related articles
- EntityAnthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- ConceptModel Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- ConceptAgentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- ConceptAlignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- EntityClaude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
