Howardism · Vol. 03Plate II · No. 02
Alignment, tagged.
Notes9TagAlignmentOldest6 May 2026Newest8 May 2026
Every article tagged alignment, newest first.
| Title | Summary | Date |
|---|---|---|
| Agentic Misalignment (AM) | Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining | |
| Alignment Fine-Tuning (AFT) | Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining | |
| Claude's Constitution / Model Spec | Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP1–2); now also a direct training input via MSM | |
| Chain-of-Thought Monitorability | Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path | |
| Deliberative Alignment | Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Cot Monitorability | |
| Model Spec Midtraining (MSM) | New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline | |
| Model Spec Science | Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026 | |
| Synthetic Document Finetuning (SDF) | Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining builds on | |
| Design Concept Grilling | Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destination doc, Kanban as journey doc |