資料來源#
摘要#
一種對齊微調方法,模型在 (prompt, chain-of-thought, response) 元組上進行訓練——CoT 會根據規範或一組政策來推理如何回應。由 Guan、Joglekar、Wallace、Jain、Bhalerao 等人(OpenAI, 2025)提出——"Deliberative alignment: Reasoning enables safer language models"(arXiv 2412.16339)。將規範內容蒸餾為監督式推理訊號。在 MSM 論文中作為最強的非 MSM 基線使用。
機制#
對於每個訓練 prompt:
- 在上下文中放入規範,讓模型產生一段長 CoT,推理如何將規範應用於此 prompt。
- 產生符合規範的回應。
- 對 (prompt, CoT, response) 進行 SFT——模型學會在部署時不需要上下文中的規範即可內部完成推理。
CoT 本身經常明確引用政策(「根據 SP2,我不被允許……」),這將基於政策的推理訓練進了 chain-of-thought 中。
與 AFT 和 MSM 的比較#
在 MSM 論文中被視為 AFT (with CoT),與 AFT (no CoT) 和 MSM + AFT 變體進行對比。
| Method | Where spec content lives | CoT supervision |
|---|---|---|
| AFT (no CoT) | Implicit in response demos | None |
| AFT with CoT (deliberative alignment) | Distilled into CoT | Yes |
| MSM + AFT (no CoT) | Pretraining-style midtraining on spec docs | None |
| MSM + AFT with CoT | Both | Yes |
實證比較(AM eval)#
Qwen2.5-32B-Instruct 基線 68% AM 率:
- AFT (with CoT)——即 deliberative alignment——降至 48%
- MSM + AFT (no CoT) 降至 5%
Qwen3-32B 基線 54%:
- AFT (with CoT) 降至 14%
- MSM + AFT (no CoT) 降至 7%
MSM + AFT (no CoT) 在兩個模型上都優於 AFT (with CoT)。這是促使盡可能避免直接 CoT 訓練的核心結果。
CoT monitorability 的權衡#
Korbak et al. 2025 認為過度訓練 chain-of-thought 會損害其作為監控器的功能——一旦你為了安全評分而優化了 CoT,它就不再忠實地揭示模型的實際推理過程。Deliberative Alignment 直接對 CoT 施加壓力。
MSM 提供了另一條路徑:模型透過在關於規範的文件上進行中期訓練來學習基於規範的推理,而非監督式 CoT。CoT 因此較少被優化,(可能)作為監控器更加忠實。MSM 是否真正保留了 CoT monitorability 被標記為一個開放問題。
與 MSM 的疊加#
MSM 論文第 5.1 節發現,value-augmented MSM(解釋規則為何存在的規範)與 rule-augmented AFT-with-CoT(帶有明確政策引用的 deliberative alignment)疊加效果良好。這表明基於規則的 deliberative-alignment 風格訓練與基於價值解釋的 MSM 是互補的,而非冗餘的。
高算力下的收斂#
在 Qwen3-32B 上使用 80k AFT 樣本時,AFT (with CoT) 收斂至 MSM+AFT 的表現(兩者接近零錯位,eval 飽和)。MSM 的優勢在低/中 AFT 算力時最大——它使 AFT 的 token 效率提升 10–60 倍。
相關連結#
- 比較於:Model Spec Midtraining (MSM)
- 增強:Alignment Fine-Tuning (AFT)
- 權衡:Chain-of-Thought Monitorability
- 評估:Agentic Misalignment (AM)
- 來源論文:Guan et al. 2025 (OpenAI)
- 研究/使用者:Anthropic(Anthropic 研究的對齊堆疊組件)
- 規範作為上下文 CoT 輸入:Claude's Constitution / Model Spec(將規範作為 CoT 生成的上下文)
- 對比:Model Spec Science(另一種教授規範內容的方式)
資料來源#
- Model Spec Midtraining: Improving How Alignment Training Generalizes(作為基線使用)
- Guan et al. 2025 — Deliberative alignment: Reasoning enables safer language models(arXiv 2412.16339)
Cited by 8
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- Alignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude's Constitution / Model Spec
Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…
- Chain-of-Thought Monitorability
Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM of…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Related articles
- Model Spec Midtraining (MSM)
New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Agentic Misalignment (AM)
Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…
- Alignment Fine-Tuning (AFT)
Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…
