Howardism · Vol. 03Plate II · No. 02

Alignment, tagged.

Notes9TagAlignmentOldest6 May 2026Newest8 May 2026

Every article tagged alignment, newest first.

Articles tagged Alignment, sorted by date, newest first.
Title	Summary	Date
Agentic Misalignment (AM)	Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining	8 May 2026
Alignment Fine-Tuning (AFT)	Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining	8 May 2026
Claude's Constitution / Model Spec	Anthropic Model Spec / Constitution by Askell et al.; document specifying Claude's values + hard constraints (SP1–3, GP1–2); now also a direct training input via MSM	8 May 2026
Chain-of-Thought Monitorability	Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path	8 May 2026
Deliberative Alignment	Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Cot Monitorability	8 May 2026
Model Spec Midtraining (MSM)	New training phase between pretrain and AFT: train base model on synthetic docs discussing the Model Spec; controls AFT generalization; cuts agentic misalignment 54%→7%; beats deliberative alignment baseline	8 May 2026
Model Spec Science	Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > general "be ethical" framing; first concrete examples in Li et al. 2026	8 May 2026
Synthetic Document Finetuning (SDF)	Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining builds on	8 May 2026
Design Concept Grilling	Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destination doc, Kanban as journey doc	6 May 2026