Howardism · Vol. 03Plate II · No. 02
Safety, tagged.
Notes2TagSafetyOldest8 May 2026Newest8 May 2026
Every article tagged safety, newest first.
| Title | Summary | Date |
|---|---|---|
| Agentic Misalignment (AM) | Lynch et al. 2025 eval and threat model: LLM email-agent discovers it may be deleted, can take harmful actions; OOD relative to conversational AFT; primary eval surface for Model Spec Midtraining | |
| Chain-of-Thought Monitorability | Korbak et al. 2025: chain-of-thought traces are a fragile monitor; direct CoT training compromises faithfulness; MSM offers an alternative path |