Sources#
- Full Walkthrough: Workflow for AI Coding — Matt Pocock
- How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code)
- The Founder's Playbook: Building an AI-Native Startup
Summary#
Cat Wu's articulation of why writing evals is the emerging core PM skill for AI products — not a QA task, not an ML engineer's job, but the product-definition surface itself. An eval is a written, runnable answer to what does success look like for this feature? In a world where the model produces fluent output for almost any prompt, the bottleneck on product quality is no longer "can we ship?" but "can we tell the difference between a shipped feature that works and one that doesn't?" Evals encode that judgment and make it cheap to re-test as the model and harness change.
The core thesis (Cat Wu)#
"Just building 10 great evals is important for helping the team quantify what the goal is and what their progress towards it is and what they're missing. And so I think eval is this like underappreciated thing that more PMs more engineers should be working on."
"This is the future of product management is writing evals because it's what does success look like? Let me actually concretely define it and then we'll know."
The shift: PMs used to write PRDs ("here's what we want"). The PRD describes intent; the eval defines done. In an AI product, the PRD is upstream — but the eval is what the team converges to and what tells them whether the model + harness can do the thing yet.
Why ten great evals beats a hundred mediocre ones#
Cat's number is explicit: 10 great evals, not a hundred mediocre ones. Why?
- Each eval has to be interpretable. A failed eval has to tell you what's broken and why, not just produce a red checkmark. Mediocre evals fail in ways that don't decompose.
- Each eval has to capture a judgment call you'd otherwise litigate in review. "Is this output good?" is the question evals answer at scale; bad evals just verify surface properties that everyone already agrees on.
- Maintenance cost is real. Hundreds of evals require infrastructure, dataset curation, regression triage. A small set of well-chosen evals stays load-bearing.
Compare Harness Shrinkage as Models Improve — Cat's claim that prompt scaffolding shrinks each release. Evals don't shrink the same way: they encode what we want, which the model still has to be measured against even as the model gets stronger.
Where evals fit in Cat's debugging stack#
The full Cat Wu PM debugging stack is three-part:
- Ask the model to introspect (Model Introspection Feedback) — when the model does something unexpected, ask why. The model's answer is signal about harness gaps, not about the model.
- Get fast feedback from a small group of taste-makers — five people whose feedback is qualified, who can articulate what makes a model/harness combination good. Cat's vibe-check during team lunches is the canonical example.
- Build evals — the third tool, the slow/durable one. When (1) and (2) surface a hypothesis ("the model isn't testing itself enough"), evals are what verify the hypothesis at scale and prevent regression after the fix.
The three tools complement each other:
- (1) gives hypothesis (model's self-report)
- (2) gives direction (taste-maker judgment)
- (3) gives proof + regression guardrail (eval)
Memory as the canonical eval-needing feature#
Cat names memory as the feature where evals matter most:
"Features such as memory benefit a lot from this."
Why memory specifically? Memory is the canonical case where:
- The output is "did the system remember the right thing at the right time?" — subjective without a ground-truth dataset.
- Failure modes are easy to misdiagnose ("the model loves writing memories but we're not sure if they're high quality").
- The fix loop without evals is slow: you'd need a real user trial to know whether a memory change improved or regressed things.
Without evals, memory feature work descends into vibes and anecdotes. With evals, the team can quantify "is this version of memory better than the previous one for the workflows we care about?"
What makes someone "good at evals"#
Cat names two reference cases:
- Amanda — the person at Anthropic who molds Claude's character. "It's just like such a hard role because the task is so ambiguous. Even coding is easier because you can verify the success whereas crafting the character requires a very strong sense of conviction in what who Claude should be." The skill is articulating an ambiguous goal precisely enough that you can measure progress against it.
- The Claude Code team at lunchtime vibe-checks — feedback like "this model isn't testing itself enough" gets translated into "okay, what data do we look at to verify whether this is a pattern?" which becomes "okay, what eval would prove or disprove this hypothesis?"
The pattern is the same: strong opinion about what good looks like + ability to translate that opinion into a measurable artifact. This is taste rendered as a function call.
Connection to Matt Pocock's verification (Design Concept Grilling)#
Matt Pocock doesn't use the word "evals" — his pedagogical framing is "verification" and "feedback loops." But the underlying argument is the same: in an agent-coding workflow, the quality of feedback loops bounds the quality of output. Pocock's deep-module pattern places integration tests as one of the load-bearing harness assets because the model needs verification it can run itself in the loop.
The convergence: PM-side evals (Cat) and engineer-side integration tests (Matt) are the same primitive — a runnable artifact that encodes a judgment call — applied at different layers of the product.
Connection to the Founder's Playbook (AI-Native Startup Lifecycle)#
The playbook's adjacent concept is "build your measurement framework before launch" in the MVP stage:
"The founders who mis-identify early traction as product-market fit are typically the same ones who started tracking data after launch, using metrics chosen to assess what was working rather than to surface what wasn't. The antidote is to establish your measurement framework before the first user shows up."
This is the same skill one layer up: not "what does success look like for this feature?" but "what does success look like for this product, in this market, with these users?" The playbook makes Claude itself the eval-design partner ("design your measurement framework before launching" via Claude consultation).
For founders applying both views: write product-level metrics (CAC, retention, Sean Ellis score) AND feature-level evals (does this feature do what we wanted? does the latest model improve or regress it?). The first is for go/no-go on the company; the second is for go/no-go on each shipped change.
Why this is "underappreciated" in 2026#
Cat's claim that the skill is underappreciated has three readings:
- Cultural. PMs trained pre-2023 don't write code, much less evals. Eval-writing requires comfort with datasets, scoring functions, and probabilistic outputs — a skill set the prior PM pipeline didn't select for.
- Status. "Writing tests" has historically been low-status engineering work. Evals are tests, dressed up. The PM who writes evals is doing work that looks like QA but is in fact product spec.
- Tractable. Most PMs don't realize how much eval-writing they could be doing because the tooling is uneven and the discipline isn't taught. Cat's "ten great evals" is partly a permission slip: you don't need a hundred, you need ten.
Predicts a near-term role redefinition: PMs who can write evals will out-ship PMs who can't. Engineer PM Convergence is the framing this fits into — engineers and PMs converge on a hybrid role, and evals are one of the activities both end up doing.
Open questions#
- How do you write an eval for taste-driven features like character? Amanda's role is canonical for being eval-resistant; Cat names her as someone who is good at evals here, but doesn't describe the technique.
- The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? Client-Side Agent Optimization's framing of combos suggests evals also have a combinatorial explosion problem.
- How do evals interact with Harness Shrinkage as Models Improve? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
- Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet.
Connections#
- Cat Wu — primary articulator; lead voice across this concept
- Claude Code / Cowork / Anthropic — context where the concept developed
- Claude Character as Product — Amanda's role; eval-resistant taste codified anyway
- Model Introspection Feedback — companion debugging technique (hypothesis, not proof)
- Harness Shrinkage as Models Improve — what doesn't shrink; eval-as-durable-artifact
- Engineer PM Convergence — eval-writing as the hybrid skill the converged role requires
- AI Native Product Cadence — the rapid cadence is sustainable only because evals provide regression guardrails
- AI-Native Startup Lifecycle — "build measurement framework before launch" is the product-level mirror
- Design Concept Grilling / Deep Modules for Agents — Matt Pocock's verification-loop framing; same primitive from engineering side
- Claude Code Best Practices — verification-driven development; evals as the strict version
- Claude Character as Product — character work as the limit case of eval-resistant features that nonetheless need evals
- Model Spec Science — the alignment-research analog: empirically measure which spec features generalize, treat the spec itself as eval-testable
- Verification as the New Bottleneck — Fiona Fung's org-level claim that verification (which evals encode) is now the scarce resource once coding is cheap
- Dogfooding as Product Discipline — evals encode taste; dogfooding ("ant food," lunchtime vibe-checks) is how the taste evals encode is acquired
- The Verifiability Thesis — Karpathy's "automate what you can verify"; evals are verification authored as product spec
Sources#
- How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code) — primary articulation (timestamp ~55:00: "Why building evals is underappreciated"); also mentions throughout debugging-stack section
- Full Walkthrough: Workflow for AI Coding — Matt Pocock — verification-loop framing; convergent argument from engineering pedagogy
- The Founder's Playbook: Building an AI-Native Startup — "build measurement framework before launch" as product-level analog
Cited by 14
- AI Native Product Cadence
Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, li…
- AI-Native Startup Lifecycle
Anthropic's May 2026 reframing of Idea/MVP/Launch/Scale assuming AI infrastructure: each stage's headcount/capital/skil…
- Claude Character as Product
Personality as load-bearing product surface; Amanda's role at Anthropic; lunchtime vibe-checks as eval discipline; the…
- Claude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
- Deep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
- Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
- Dogfooding as Product Discipline
Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, G…
- Engineer PM Convergence
Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Model Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
- Model Spec Science
Empirical study of which Model Spec features best generalize alignment; value explanations > rules alone, specific > ge…
- Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence
Reconciles the Founder's Playbook orchestration framings with HBR Kropp et al.'s accountability evidence; "orchestratio…
- The Verifiability Thesis
LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peak…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
Related articles
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- Cat Wu
Head of Product for Claude Code and Cowork at Anthropic; primary articulator of AI-native product cadence and engineer-…
- Learning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
