Howardism · Vol. 03Plate II · No. 02

Alignment, in order.

Notes6TopicAlignmentOldest10 Apr 2026Newest8 May 2026

RLHF, character, interpretability, and safety case studies.

Alignment articles, sorted by date, newest first.
Title	Summary	Date
Agentic Misalignment (AM)	Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining	8 May 2026
Alignment Fine-Tuning (AFT)	Standard post-pretraining stage (SFT + RLHF) for installing values; shallow-alignment failure mode motivates Model Spec Midtraining	8 May 2026
Chain-of-Thought Monitorability	Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path	8 May 2026
Deliberative Alignment	Guan et al. 2025 (OpenAI): SFT on (prompt, CoT, response) tuples with spec-grounded CoT; strongest non-MSM baseline; risks compromising Cot Monitorability	8 May 2026
Synthetic Document Finetuning (SDF)	Wang et al. 2025 technique for modifying model beliefs via fine-tuning on synthetic documents; foundation that Model Spec Midtraining builds on	8 May 2026
LLM-Driven Vulnerability Research	Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and Anthropic's Project Glasswing response	10 Apr 2026