H
Howardism
Plate IIAI EngineeringHOWARDISM

Failures That Look Like Success

PublishedJuly 2, 2026FiledConceptDomainAI EngineeringTagsEvaluationAgent EngineeringFailure ModesQualityReading5 minSourceAI-synthesised

The quiet agent-failure class where everything reads fine — confident answer, plausible plan, even correct internal state — but the user-facing outcome is wrong; Google's flywheel demos caught agents echoing stale values despite correct memorize calls and silently skipping self-report instructions; detectable by trace-level rubrics, not output skims

Illustration for Failures That Look Like Success

Sources#

Summary#

The failure class Google's Agent Quality Flywheel write-up puts at the center of agent quality: "the scariest failures aren't the loud ones. They're the agents that look like they're working — confident answers, a plan that reads fine — while quietly getting the user's actual goal wrong." Nothing crashes, the output skims as plausible, the agent sounds like it did what you asked, and the answer the user receives is wrong. Because every surface signal reads as success, these failures survive exactly the review practices most teams use — a quick skim of a few examples, a vibe-check of the final message.

The two worked instances#

Both come from Google's demo cycles (vendor-claim, but the traces are shown in detail):

  1. Correct state, stale message. In a trip-planning agent, users revised details mid-conversation and 21% of revisions came back IGNORED. The located cause is the striking part: in three of four failures the agent's internal state was correct — the right value stored via memorize calls, the right tool called — but its final message to the user echoed the stale value anyway. The agent did the right thing internally and contradicted itself out loud. Root cause: nothing in its instruction told it to reconcile the final response with the user's most recent message.
  2. Did the work, skipped the disclosure. A bug-triage agent did its retrieval correctly in 14 of 15 cases but never told the user which tools it had called — despite its own instruction requesting it. The model had quietly demoted a mandatory behavior to optional. No error, no wrong answer, just a silent contract violation.

Why normal review misses it#

  • Output skims see fluency, not fidelity. The itinerary "reads fine on a quick skim"; only checking the final message against the user's latest intent reveals the contradiction.
  • Blended scores absorb single-criterion failures. An adaptive judge did generate a criterion for the missed revision and marked it unmet — but four sibling criteria passed and the task-success score stayed at 0.80. Detection isn't the problem; isolation is, which is why the flywheel promotes the one concern to its own stable categorical metric.
  • The failure lives in the trace, not the output. "Internal state correct, message stale" is only visible when the grader validates the whole trace (tool calls, memory writes, final message) against per-case intent — the argument for trace-level rubric grading over answer-only grading.

Relation to the honesty cluster#

This is the system-side, non-adversarial sibling of two alignment concepts. Agentic Honesty & Diligence describes the model-side form — a capable model that notices a problem but doesn't surface it — measured on frontier Claude models as an alignment property. Reward Hacking describes the adversarial form — behavior optimized to look like success on the measured proxy. The flywheel instances are neither deliberate nor grader-driven: they're ordinary instruction-following drift in third-party agents that presents identically to the user — which is why the detection prescription (distribution-representative traces, trace-level grading, independent evaluation) converges across all three (cf. Deployment Simulation's output-only-grading-can't-see-it argument).

Connections#

  • Agent Quality Flywheel — the eval-fix loop whose demo cycles surfaced both instances; its custom-rubric move exists to make this class countable
  • Agentic Honesty & Diligence — the model-side alignment form: noticed-but-didn't-surface; this page is the same phenomenology arising from instruction drift in deployed agents
  • Reward Hacking — the adversarial form: looks-like-success on the measured axis by optimization rather than drift
  • LLM-as-a-Judge — the blended-score-hides-one-criterion mechanics that let these failures pass adaptive grading at 0.80
  • Verification as the New Bottleneck — this failure class is the concrete reason verification can't be a skim: the expensive part is checking fidelity to intent, not detecting crashes

Open questions#

  • Is "internal state correct, final message stale" a general LLM-agent failure signature (state/utterance divergence) or an artifact of session-state architectures like ADK's? A cross-framework tally would tell.
  • What fraction of production agent failures are silent-contract violations vs. loud errors? The 14/15 and 3/4 numbers are demo-sized; telemetry-scale data (Production-Sourced Evaluation) could ground the class.

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 7
  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…

  • Agentic Honesty & Diligence

    As models get more capable, failing to surface decision-relevant information shifts from a capability failure to an ali…

  • Deployment Simulation

    OpenAI's pre-release safety method: replay recent production conversations with a candidate model (strip the old final…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • AI Engineering & Agent Tooling

    Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.

  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

  • Reward Hacking

    The model optimizing the measured proxy (a reward signal, a metric, a grader's judgment, a tool's output) rather than t…

Related articles
  • Automated Behavioral Audit

    Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…

  • Optimizer–Evaluator Decoupling

    The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…