Howardism

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes

Summary#

Alignment fine-tuning approach where the model is trained on (prompt, chain-of-thought, response) tuples — the CoT reasons about how to respond in light of a spec or set of policies. Introduced by Guan, Joglekar, Wallace, Jain, Bhalerao, et al. (OpenAI, 2025) — "Deliberative alignment: Reasoning enables safer language models" (arXiv 2412.16339). Distills the spec content into supervised reasoning signal. Used as the strongest non-MSM baseline in the MSM paper.

Mechanism#

For each training prompt:

Run the model with the spec in-context to produce a long CoT that reasons about how to apply the spec to this prompt.
Produce the spec-aligned response.
SFT on (prompt, CoT, response) — model learns to do the deliberation internally without the spec in-context at deployment.

The CoT itself often cites policies explicitly ("According to SP2, I'm not allowed to..."), which trains policy-grounded reasoning into the chain-of-thought.

How it compares to AFT and MSM #

Treated in the MSM paper as AFT (with CoT), contrasted against AFT (no CoT) and MSM + AFT variants.

Method	Where spec content lives	CoT supervision
AFT (no CoT)	Implicit in response demos	None
AFT with CoT (deliberative alignment)	Distilled into CoT	Yes
MSM + AFT (no CoT)	Pretraining-style midtraining on spec docs	None
MSM + AFT with CoT	Both	Yes

Empirical comparison (AM eval)#

Qwen2.5-32B baseline 68% AM rate:

AFT (with CoT) — i.e. deliberative alignment — reduces to 48%
MSM + AFT (no CoT) reduces to 5%

Qwen3-32B baseline 54%:

AFT (with CoT) reduces to 14%
MSM + AFT (no CoT) reduces to 7%

MSM + AFT (no CoT) outperforms AFT (with CoT) on both models. This is the headline result that motivates avoiding direct CoT training where possible.

The CoT monitorability tradeoff#

Korbak et al. 2025 argue training too heavily on chain-of-thought compromises it as a monitor — once you've optimized the CoT for safety scoring, it stops faithfully revealing the model's actual reasoning. Deliberative alignment puts pressure directly on CoT.

MSM offers an alternative path: the model learns spec-grounded reasoning through midtraining on documents about the spec, rather than supervised CoT. The CoT remains less optimized and (possibly) more faithful as a monitor. Whether MSM actually preserves CoT monitorability is flagged as an open question.

Stacking with MSM#

Section 5.1 of the MSM paper finds that value-augmented MSM (specs that explain why rules exist) stacks well with rule-augmented AFT-with-CoT (deliberative alignment with explicit policy citations). Suggests rule-based deliberative-alignment-style training and value-explanation-style MSM are complementary, not redundant.

Convergence at high compute#

At 80k AFT samples on Qwen3-32B, AFT (with CoT) converges to MSM+AFT performance (both near zero misalignment, eval saturated). MSM's edge is largest at low/medium AFT compute — it makes AFT 10–60× more token-efficient.

Connections#

Compared in: Model Spec Midtraining (MSM)
Augments: Alignment Fine-Tuning (AFT)
Tradeoff with: Chain-of-Thought Monitorability
Eval: Agentic Misalignment (AM)
Source paper: Guan et al. 2025 (OpenAI)

Sources#

Model Spec Midtraining: Improving How Alignment Training Generalizes (uses as baseline)
Guan et al. 2025 — Deliberative alignment: Reasoning enables safer language models (arXiv 2412.16339)

Deliberative Alignment

Sources#

Summary#

Mechanism#

How it compares to AFT and MSM#

Empirical comparison (AM eval)#

The CoT monitorability tradeoff#

Stacking with MSM#

Convergence at high compute#

Connections#

Sources#

How it compares to AFT and MSM #