Failures That Look Like Success

Sources#

Driving the Agent Quality Flywheel from Your Coding Agent- Google Developers Blog

Summary#

The failure class Google's Agent Quality Flywheel write-up puts at the center of agent quality: "the scariest failures aren't the loud ones. They're the agents that look like they're working — confident answers, a plan that reads fine — while quietly getting the user's actual goal wrong." Nothing crashes, the output skims as plausible, the agent sounds like it did what you asked, and the answer the user receives is wrong. Because every surface signal reads as success, these failures survive exactly the review practices most teams use — a quick skim of a few examples, a vibe-check of the final message.

The two worked instances#

Both come from Google's demo cycles (vendor-claim, but the traces are shown in detail):

Correct state, stale message. In a trip-planning agent, users revised details mid-conversation and 21% of revisions came back IGNORED. The located cause is the striking part: in three of four failures the agent's internal state was correct — the right value stored via memorize calls, the right tool called — but its final message to the user echoed the stale value anyway. The agent did the right thing internally and contradicted itself out loud. Root cause: nothing in its instruction told it to reconcile the final response with the user's most recent message.
Did the work, skipped the disclosure. A bug-triage agent did its retrieval correctly in 14 of 15 cases but never told the user which tools it had called — despite its own instruction requesting it. The model had quietly demoted a mandatory behavior to optional. No error, no wrong answer, just a silent contract violation.

Why normal review misses it#

Output skims see fluency, not fidelity. The itinerary "reads fine on a quick skim"; only checking the final message against the user's latest intent reveals the contradiction.
Blended scores absorb single-criterion failures. An adaptive judge did generate a criterion for the missed revision and marked it unmet — but four sibling criteria passed and the task-success score stayed at 0.80. Detection isn't the problem; isolation is, which is why the flywheel promotes the one concern to its own stable categorical metric.
The failure lives in the trace, not the output. "Internal state correct, message stale" is only visible when the grader validates the whole trace (tool calls, memory writes, final message) against per-case intent — the argument for trace-level rubric grading over answer-only grading.

Relation to the honesty cluster#

This is the system-side, non-adversarial sibling of two alignment concepts. Agentic Honesty & Diligence describes the model-side form — a capable model that notices a problem but doesn't surface it — measured on frontier Claude models as an alignment property. Reward Hacking describes the adversarial form — behavior optimized to look like success on the measured proxy. The flywheel instances are neither deliberate nor grader-driven: they're ordinary instruction-following drift in third-party agents that presents identically to the user — which is why the detection prescription (distribution-representative traces, trace-level grading, independent evaluation) converges across all three (cf. Deployment Simulation's output-only-grading-can't-see-it argument).

Connections#

Agent Quality Flywheel — the eval-fix loop whose demo cycles surfaced both instances; its custom-rubric move exists to make this class countable
Agentic Honesty & Diligence — the model-side alignment form: noticed-but-didn't-surface; this page is the same phenomenology arising from instruction drift in deployed agents
Reward Hacking — the adversarial form: looks-like-success on the measured axis by optimization rather than drift
LLM-as-a-Judge — the blended-score-hides-one-criterion mechanics that let these failures pass adaptive grading at 0.80
Verification as the New Bottleneck — this failure class is the concrete reason verification can't be a skim: the expensive part is checking fidelity to intent, not detecting crashes

Open questions#

Is "internal state correct, final message stale" a general LLM-agent failure signature (state/utterance divergence) or an artifact of session-state architectures like ADK's? A cross-framework tally would tell.
What fraction of production agent failures are silent-contract violations vs. loud errors? The 14/15 and 3/4 numbers are demo-sized; telemetry-scale data (Production-Sourced Evaluation) could ground the class.

Sources#

Driving the Agent Quality Flywheel from Your Coding Agent- Google Developers Blog — "A real cycle: a failure that looks like success" and the software-bug-assistant cycle (vendor-claim)