Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 40HOWARDISM

Model Introspection Feedback

PublishedMay 6, 2026FiledConceptReading5 minSourceAI-synthesised

Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticism; caveats around model self-report fidelity

Illustration for Model Introspection Feedback

Sources#

Summary#

Cat Wu names this as her single most underrated AI debugging technique: when Claude does something unexpected, ask the model to reflect on why. The reasoning the model surfaces — about its system prompt, sub-agent delegation choices, ambiguous instructions, missing tools — points directly at what to fix in the harness. Don't tune the model behavior in isolation; treat the failure as a signal about what the harness needs to provide.

The technique#

When the agent does something wrong:

  1. Don't immediately re-prompt with corrections.
  2. Ask: "Why did you make that decision? What were you confused about?"
  3. Read the model's account of its own reasoning.
  4. Fix the harness — not the model — based on what it surfaces.

Cat Wu's worked example#

From the interview:

"There's situations where the model will make a front-end change and run tests but not actually use the UI. It's actually pretty useful to ask the model to reflect on why it did this. And sometimes they'll say that hey there was something confusing in the system prompt, or I didn't realize that the front-end verification was part of this task, or hey I delegated the verification to this sub-agent and the sub-agent didn't do the test and I didn't check its work. A lot of times just being very curious about why the model made the decision that it did will show you what misled it so that you can fix the harness in order to close this gap."

Each of the failure-explanations she lists points at a different harness fix:

Model's reasonHarness fix
"Confusing system prompt"Rewrite the relevant section
"Didn't realize UI verification was part of the task"Add UI verification step explicitly to the task template
"Delegated to sub-agent, didn't check its work"Reviewer agent in fresh context (see Deep Modules for Agents); or remove the delegation

Why this is non-obvious#

The default reaction to a failure is "the model is dumb." The introspection reframe is: the model's behavior is a function of the harness; the failure is information about the harness. This requires accepting that:

  1. The model's account of its reasoning is partial/post-hoc but still useful (caveat below).
  2. The harness is the variable you control; the model isn't.
  3. Your job is to design an environment where the model can succeed, not to make the model smarter.

This sits at the same architectural level as "enforce invariants, not implementations" (see Agent Harness Engineering) — you're operating on the structure around the model, not on the model itself.

Caveats — model self-reports aren't ground truth#

Standard caveat: a language model's introspective reports describe what it would say about its reasoning, not necessarily the actual computation. The reports can be confabulated, especially in a session where the failure is already in context.

Mitigations:

  • Run the introspection in fresh context with just the failure case, not the surrounding session
  • Cross-check against logs (which tools were called, in what order)
  • Treat model reports as hypotheses about harness gaps, not final diagnoses — verify the fix actually closes the failure mode

Generalization: trust calibration through dialogue#

The technique is part of a broader pattern at Anthropic: dialogue with the model as the primary feedback signal, not just metrics. Cat's other examples in the same interview:

  • Team lunches at Claude Code where every member is asked their "vibe" on a new model — qualitative signal that informs which quantitative data to look at next.
  • Amanda's character work: ambiguous goal, requires conviction, shaped by extended dialogue with the model rather than benchmark optimization.
  • "Build evals" as an underappreciated PM skill (see building evals discussion in Cat Wu entity page).

The unifying theme: in AI-native product work, the model is a teammate you can interview, not a black box you can only measure.

Connections#

Open questions#

  • How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
  • Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
  • Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

4 articles link here
  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • EntityCat Wu

    Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…

  • ConceptClaude Character as Product

    Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…

  • EssayLearning to Co-Work with AI: A Software Engineer's Field Guide

    Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…

Related articles
  • ConceptHarness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • ConceptAgent Loop Pattern

    `/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…

  • ConceptAI Native Product Cadence

    Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…

  • EntityAnthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • EntityClaude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…