H
Howardism
Plate II機器翻譯 · machine-translatedENHOWARDISM

Deliberative Alignment

PublishedMay 8, 2026FiledConceptTagsAlignmentTrainingChain Of ThoughtOpenaiReading4 minSourceAI-synthesised

Guan et al. 2025 (OpenAI):以 (prompt, CoT, response) 元組進行 SFT,搭配基於規範的 CoT;最強非 MSM 基線;可能損害 CoT monitorability

Deliberative Alignment 的示意圖

資料來源#

摘要#

一種對齊微調方法,模型在 (prompt, chain-of-thought, response) 元組上進行訓練——CoT 會根據規範或一組政策來推理如何回應。由 Guan、Joglekar、Wallace、Jain、Bhalerao 等人(OpenAI, 2025)提出——"Deliberative alignment: Reasoning enables safer language models"(arXiv 2412.16339)。將規範內容蒸餾為監督式推理訊號。在 MSM 論文中作為最強的非 MSM 基線使用。

機制#

對於每個訓練 prompt:

  1. 在上下文中放入規範,讓模型產生一段長 CoT,推理如何將規範應用於此 prompt。
  2. 產生符合規範的回應。
  3. 對 (prompt, CoT, response) 進行 SFT——模型學會在部署時不需要上下文中的規範即可內部完成推理。

CoT 本身經常明確引用政策(「根據 SP2,我不被允許……」),這將基於政策的推理訓練進了 chain-of-thought 中。

AFTMSM 的比較#

在 MSM 論文中被視為 AFT (with CoT),與 AFT (no CoT)MSM + AFT 變體進行對比。

MethodWhere spec content livesCoT supervision
AFT (no CoT)Implicit in response demosNone
AFT with CoT (deliberative alignment)Distilled into CoTYes
MSM + AFT (no CoT)Pretraining-style midtraining on spec docsNone
MSM + AFT with CoTBothYes

實證比較(AM eval)#

Qwen2.5-32B-Instruct 基線 68% AM 率:

  • AFT (with CoT)——即 deliberative alignment——降至 48%
  • MSM + AFT (no CoT) 降至 5%

Qwen3-32B 基線 54%:

  • AFT (with CoT) 降至 14%
  • MSM + AFT (no CoT) 降至 7%

MSM + AFT (no CoT) 在兩個模型上都優於 AFT (with CoT)。這是促使盡可能避免直接 CoT 訓練的核心結果。

CoT monitorability 的權衡#

Korbak et al. 2025 認為過度訓練 chain-of-thought 會損害其作為監控器的功能——一旦你為了安全評分而優化了 CoT,它就不再忠實地揭示模型的實際推理過程。Deliberative Alignment 直接對 CoT 施加壓力。

MSM 提供了另一條路徑:模型透過在關於規範的文件上進行中期訓練來學習基於規範的推理,而非監督式 CoT。CoT 因此較少被優化,(可能)作為監控器更加忠實。MSM 是否真正保留了 CoT monitorability 被標記為一個開放問題。

與 MSM 的疊加#

MSM 論文第 5.1 節發現,value-augmented MSM(解釋規則為何存在的規範)與 rule-augmented AFT-with-CoT(帶有明確政策引用的 deliberative alignment)疊加效果良好。這表明基於規則的 deliberative-alignment 風格訓練與基於價值解釋的 MSM 是互補的,而非冗餘的。

高算力下的收斂#

在 Qwen3-32B 上使用 80k AFT 樣本時,AFT (with CoT) 收斂至 MSM+AFT 的表現(兩者接近零錯位,eval 飽和)。MSM 的優勢在低/中 AFT 算力時最大——它使 AFT 的 token 效率提升 10–60 倍。

相關連結#

資料來源#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 8
  • Agentic Misalignment (AM)

    Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…

  • Alignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Claude's Constitution / Model Spec

    Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP…

  • Chain-of-Thought Monitorability

    Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM of…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 19 concepts. Curated entry point; see Home for all domains.

  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Model Spec Science

    Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…

Related articles
  • Model Spec Midtraining (MSM)

    New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Claude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…

  • Agentic Misalignment (AM)

    Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD rel…

  • Alignment Fine-Tuning (AFT)

    Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec…