Plate IIProduct & Org中文HOWARDISM

Evals as Product Spec

PublishedMay 18, 2026FiledConceptDomainProduct & OrgTagsEvals Product ManagementDefinition Of DonePm SkillMeasurementReading13 minSourceAI-synthesised

Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done looks like for ambiguous AI features; companion to introspection (hypothesis) and vibe-check (direction)

Sources#

Summary#

Cat Wu's articulation of why writing evals is the emerging core PM skill for AI products — not a QA task, not an ML engineer's job, but the product-definition surface itself. An eval is a written, runnable answer to what does success look like for this feature? In a world where the model produces fluent output for almost any prompt, the bottleneck on product quality is no longer "can we ship?" but "can we tell the difference between a shipped feature that works and one that doesn't?" Evals encode that judgment and make it cheap to re-test as the model and harness change.

The core thesis (Cat Wu)#

"Just building 10 great evals is important for helping the team quantify what the goal is and what their progress towards it is and what they're missing. And so I think eval is this like underappreciated thing that more PMs more engineers should be working on."

"This is the future of product management is writing evals because it's what does success look like? Let me actually concretely define it and then we'll know."

The shift: PMs used to write PRDs ("here's what we want"). The PRD describes intent; the eval defines done. In an AI product, the PRD is upstream — but the eval is what the team converges to and what tells them whether the model + harness can do the thing yet.

Why ten great evals beats a hundred mediocre ones#

Cat's number is explicit: 10 great evals, not a hundred mediocre ones. Why?

Each eval has to be interpretable. A failed eval has to tell you what's broken and why, not just produce a red checkmark. Mediocre evals fail in ways that don't decompose.
Each eval has to capture a judgment call you'd otherwise litigate in review. "Is this output good?" is the question evals answer at scale; bad evals just verify surface properties that everyone already agrees on.
Maintenance cost is real. Hundreds of evals require infrastructure, dataset curation, regression triage. A small set of well-chosen evals stays load-bearing.

Compare Harness Shrinkage as Models Improve — Cat's claim that prompt scaffolding shrinks each release. Evals don't shrink the same way: they encode what we want, which the model still has to be measured against even as the model gets stronger.

Where evals fit in Cat's debugging stack#

The full Cat Wu PM debugging stack is three-part:

Ask the model to introspect (Model Introspection Feedback) — when the model does something unexpected, ask why. The model's answer is signal about harness gaps, not about the model.
Get fast feedback from a small group of taste-makers — five people whose feedback is qualified, who can articulate what makes a model/harness combination good. Cat's vibe-check during team lunches is the canonical example.
Build evals — the third tool, the slow/durable one. When (1) and (2) surface a hypothesis ("the model isn't testing itself enough"), evals are what verify the hypothesis at scale and prevent regression after the fix.

The three tools complement each other:

(1) gives hypothesis (model's self-report)
(2) gives direction (taste-maker judgment)
(3) gives proof + regression guardrail (eval)

Memory as the canonical eval-needing feature#

Cat names memory as the feature where evals matter most:

"Features such as memory benefit a lot from this."

Why memory specifically? Memory is the canonical case where:

The output is "did the system remember the right thing at the right time?" — subjective without a ground-truth dataset.
Failure modes are easy to misdiagnose ("the model loves writing memories but we're not sure if they're high quality").
The fix loop without evals is slow: you'd need a real user trial to know whether a memory change improved or regressed things.

Without evals, memory feature work descends into vibes and anecdotes. With evals, the team can quantify "is this version of memory better than the previous one for the workflows we care about?"

What makes someone "good at evals"#

Cat names two reference cases:

Amanda — the person at Anthropic who molds Claude's character. "It's just like such a hard role because the task is so ambiguous. Even coding is easier because you can verify the success whereas crafting the character requires a very strong sense of conviction in what who Claude should be." The skill is articulating an ambiguous goal precisely enough that you can measure progress against it.
The Claude Code team at lunchtime vibe-checks — feedback like "this model isn't testing itself enough" gets translated into "okay, what data do we look at to verify whether this is a pattern?" which becomes "okay, what eval would prove or disprove this hypothesis?"

The pattern is the same: strong opinion about what good looks like + ability to translate that opinion into a measurable artifact. This is taste rendered as a function call.

Connection to Matt Pocock's verification (Design Concept Grilling)#

Matt Pocock doesn't use the word "evals" — his pedagogical framing is "verification" and "feedback loops." But the underlying argument is the same: in an agent-coding workflow, the quality of feedback loops bounds the quality of output. Pocock's deep-module pattern places integration tests as one of the load-bearing harness assets because the model needs verification it can run itself in the loop.

The convergence: PM-side evals (Cat) and engineer-side integration tests (Matt) are the same primitive — a runnable artifact that encodes a judgment call — applied at different layers of the product.

Connection to the Founder's Playbook (AI-Native Startup Lifecycle)#

The playbook's adjacent concept is "build your measurement framework before launch" in the MVP stage:

"The founders who mis-identify early traction as product-market fit are typically the same ones who started tracking data after launch, using metrics chosen to assess what was working rather than to surface what wasn't. The antidote is to establish your measurement framework before the first user shows up."

This is the same skill one layer up: not "what does success look like for this feature?" but "what does success look like for this product, in this market, with these users?" The playbook makes Claude itself the eval-design partner ("design your measurement framework before launching" via Claude consultation).

For founders applying both views: write product-level metrics (CAC, retention, Sean Ellis score) AND feature-level evals (does this feature do what we wanted? does the latest model improve or regress it?). The first is for go/no-go on the company; the second is for go/no-go on each shipped change.

The twist: eval-authoring itself gets automated (Google, June 2026)#

Google's Agent Quality Flywheel is the first shipped product built on the premise that the eval-writing Cat Wu calls the emerging core PM skill can be done by the coding agent itself. The developer's whole contribution is a plain-language worry ("does my agent honor mid-conversation revisions?") and an approval; the skill reads the code, chooses metrics, designs a custom rubric, synthesizes test scenarios, and reports before/after deltas — "you wrote none of it… you described the goal." This doesn't refute the thesis, it relocates it, the same way PRDs relocated (Prototype Over PRD): the durable human skill compresses to articulating what success looks like precisely enough to state the worry, and judging whether the machine-authored eval actually encodes it. The ten-great-evals discipline survives too — the flywheel's key move is promoting one concern to one stable, interpretable metric rather than accumulating a hundred blended ones.

Why this is "underappreciated" in 2026#

Cat's claim that the skill is underappreciated has three readings:

Cultural. PMs trained pre-2023 don't write code, much less evals. Eval-writing requires comfort with datasets, scoring functions, and probabilistic outputs — a skill set the prior PM pipeline didn't select for.
Status. "Writing tests" has historically been low-status engineering work. Evals are tests, dressed up. The PM who writes evals is doing work that looks like QA but is in fact product spec.
Tractable. Most PMs don't realize how much eval-writing they could be doing because the tooling is uneven and the discipline isn't taught. Cat's "ten great evals" is partly a permission slip: you don't need a hundred, you need ten.

Predicts a near-term role redefinition: PMs who can write evals will out-ship PMs who can't. Engineer PM Convergence is the framing this fits into — engineers and PMs converge on a hybrid role, and evals are one of the activities both end up doing.

Open questions#

How do you write an eval for taste-driven features like character? Amanda's role is canonical for being eval-resistant; Cat names her as someone who is good at evals here, but doesn't describe the technique. Partially answered: How Do You Write Evals for Taste? Character as the Limit Case — the technique is a pipeline (conviction → dogfood-sourced failure modes → MSM-style variant A/B measurement → ~10 interpretable evals); proven on the safety/values core but still tacit on the warm/witty aesthetic surface.
The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? Client-Side Agent Optimization's framing of combos suggests evals also have a combinatorial explosion problem.
How do evals interact with Harness Shrinkage as Models Improve? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet. Partially answered (with a twist): Google's Agent Quality Flywheel is a third-party arrival at eval-as-the-quality-surface — but its answer is to have the coding agent author the eval, compressing the human role to stating the worry and approving the plan.

Connections#

Cat Wu — primary articulator; lead voice across this concept
Claude Code / Cowork / Anthropic — context where the concept developed
Claude Character as Product — Amanda's role; eval-resistant taste codified anyway
Model Introspection Feedback — companion debugging technique (hypothesis, not proof)
Harness Shrinkage as Models Improve — what doesn't shrink; eval-as-durable-artifact
Engineer PM Convergence — eval-writing as the hybrid skill the converged role requires
AI Native Product Cadence — the rapid cadence is sustainable only because evals provide regression guardrails
AI-Native Startup Lifecycle — "build measurement framework before launch" is the product-level mirror
Design Concept Grilling / Deep Modules for Agents — Matt Pocock's verification-loop framing; same primitive from engineering side
Claude Code Best Practices — verification-driven development; evals as the strict version
Claude Character as Product — character work as the limit case of eval-resistant features that nonetheless need evals
Model Spec Science — the alignment-research analog: empirically measure which spec features generalize, treat the spec itself as eval-testable
Verification as the New Bottleneck — Fiona Fung's org-level claim that verification (which evals encode) is now the scarce resource once coding is cheap
Dogfooding as Product Discipline — evals encode taste; dogfooding ("ant food," lunchtime vibe-checks) is how the taste evals encode is acquired
The Verifiability Thesis — Karpathy's "automate what you can verify"; evals are verification authored as product spec
How Do You Write Evals for Taste? Character as the Limit Case — the synthesized technique for the hardest case (taste/character): how conviction + dogfooding + MSM variant-comparison combine into a runnable eval
DRACO Benchmark — evals externalized to benchmark scale: expert rubrics as the eval set, graded automatically
LLM-as-a-Judge — how rubric-style evals scale to open-ended output; the grading primitive behind DRACO
Production-Sourced Evaluation — "build your measurement framework from real usage" at benchmark scale
Telemetry vs. Survey Measurement — Faros AI's "measure what actually shipped, not how people feel" is the engineering-metrics cousin of preferring runnable evals over self-report
Agent Quality Flywheel — eval-authoring automated: the coding agent translates a plain-language worry into metric choice, rubric design, and before/after deltas; the human states the goal and approves

Sources#

How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code) — primary articulation (timestamp ~55:00: "Why building evals is underappreciated"); also mentions throughout debugging-stack section
Full Walkthrough: Workflow for AI Coding — Matt Pocock — verification-loop framing; convergent argument from engineering pedagogy
The Founder's Playbook: Building an AI-Native Startup — "build measurement framework before launch" as product-level analog
Driving the Agent Quality Flywheel from Your Coding Agent- Google Developers Blog — the coding-agent-as-eval-author demo ("you wrote none of it… you described the goal") (vendor-claim)

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 26

Agent Quality Flywheel
Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
DRACO Benchmark
Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
Product & Organization
Map of Content for the product-org domain — 8 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Prototype Over PRD
Dan Carey's prototype-replaces-PRD method: record a why-not-what conversation, transcribe it, hand the transcript to Cl…
Telemetry vs. Survey Measurement
Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

Cited by 26

Agent Quality Flywheel
Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…
AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
AI-Native Product Org Bottlenecks
AI-native product-org bottleneck is accountable taste at speed: dogfooding trains taste, evals encode it, and accountab…
AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
DRACO Benchmark
Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…
Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Human-in-the-Loop Boundaries
Humans belong at allocation, understanding, design-concept, risk, and accountability boundaries; they slow the system d…
LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
Product & Organization
Map of Content for the product-org domain — 8 concepts. Curated entry point; see Home for all domains.
Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
Prototype Over PRD
Dan Carey's prototype-replaces-PRD method: record a why-not-what conversation, transcribe it, hand the transcript to Cl…
Telemetry vs. Survey Measurement
Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…
The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…