How Do You Write Evals for Taste? Character as the Limit Case

The paradox to resolve#

Cat Wu holds two claims that look contradictory. First, character is the hardest thing to evaluate — "coding is easier because you can verify the success whereas crafting the character requires a very strong sense of conviction in who Claude should be" (Claude Character as Product). Second, she names Amanda — who molds Claude's character — as someone who is good at evals, "alongside the team-lunch vibe-check" (Evals as Product Spec). So taste is eval-resistant, yet someone evals it well. The resolution is that a taste eval is not a single scoring function bolted onto a fuzzy goal; it is a pipeline that converts conviction into a measurable artifact, and the three source concepts each supply one stage of it.

The technique, in three stages#

Stage 1 — Conviction is the precondition, not the eval#

The thing that makes taste eval-able is upstream of any dataset: "a very strong sense of conviction in who Claude should be" (Claude Character as Product). Cat's two-skill decomposition of Amanda's role is exactly this — (1) convicted articulation of who Claude should be, and (2) the ability to say why a given response is on- or off-character (Claude Character as Product). Skill (2) is the eval skill: "strong opinion about what good looks like + ability to translate that opinion into a measurable artifact … taste rendered as a function call" (Evals as Product Spec). Without conviction, character work "drifts toward bland averaging" — and so would any eval built on it, because you'd have no ground truth to label against.

This is why taste evals can't be outsourced to a generic rubric: the rubric is the opinion. The skill is rare precisely because the articulation, not the measurement, is the hard part.

Stage 2 — Dogfooding sources the failure modes; introspection narrows the search#

You cannot write the eval before you know what "off-character" looks like in the wild. Two companion techniques feed the eval:

Qualitative-first vibe-checks (Dogfooding as Product Discipline): the team-lunch ritual asks "what is your vibe on the model?" and surfaces concrete failure signals — "this model is too abrupt," "loves writing memories but quality is uncertain," "doesn't test itself enough" (Claude Character as Product). This is the dogfooding discipline judging an output with no compiler.
Hypothesis → data probe (Model Introspection Feedback): the vibe signal informs which logged data to look at, "not the other way around. The team has too much data to mine blind; tacit signal narrows where to look" (Claude Character as Product). Introspection gives the hypothesis; the taste-makers give the direction; the eval gives proof + regression guardrail (Evals as Product Spec).

The eval is the last tool in the stack, the slow durable one — it codifies a judgment that vibe-checks first discovered and a data probe first confirmed.

Stage 3 — Render the judgment as a runnable artifact and measure it across variants#

Model Spec Science is the existence proof that ambiguous, values-shaped goals can be rendered empirically — and it supplies the measurement method:

Write down the spec of what you want (the Constitution is the textual side of character; "character is the felt-experience side, the constitution is the textual specification" — Claude Character as Product).
Produce two variants and measure which generalizes better. The MSM paper does exactly this: Rules Spec vs Value-Augmented vs Rule-Augmented, trained on Qwen, scored on the Agentic Misalignment (AM) behavioral eval (Model Spec Science §5.1). Specs differed by tens of percentage points — e.g. value explanations cut policy-misuse from 20%→2% vs 6%→0% across model families.
Read the result back into authoring decisions: value explanations beat bare rules; specific guidance beats general "be ethical" framing (Model Spec Science §5.2). That is taste, optimized.

This is the same primitive Evals as Product Spec describes — "a runnable artifact that encodes a judgment call" — and the two concepts explicitly mirror each other: model-spec-science is the "product-side mirror … both insist that ambiguous-looking artifacts (specs, character, taste) can be rendered as runnable verification."

A concrete recipe (synthesized)#

Combining the three:

State the conviction as a short, opinionated rubric ("low-ego, lighthearted-but-competent, honest-not-sycophantic" — Claude Character as Product). This is the spec.
Mine dogfooding signal for recurring off-character failure modes; turn each into labeled on-character/off-character example pairs.
Encode the judgment as a scoring function — a council of LLM judges (The Verifiability Thesis) or a human-rated rubric — that reproduces Amanda's "why this is off-character."
A/B across variants (model versions, spec versions) and measure drift — MSM's method applied to your feature.
Keep ~10, not 100. Each must be interpretable (a fail tells you what's broken), capture a judgment you'd otherwise litigate in review, and stay low-maintenance (Evals as Product Spec). Taste evals obey the same Goldilocks rule as any eval.

Why this is the limit case, and where it still breaks#

Character is the hardest end of the spectrum for two structural reasons:

No clean A/B isolation. Character interacts with capability, so "it's hard to say how much of 'Claude is great' is character vs reasoning" (Claude Character as Product). The eval can measure on-character-ness but struggles to attribute product outcomes to it.
The verifier's own boundary. The Verifiability Thesis's open question is whether a council of LLM judges holds "for genuinely contested value judgments, or only for quality/coherence." Character (warmth, wit) sits in the contested zone, not the coherence zone — so the judge is itself doing taste, recursively.

And the honest gap in the evidence: MSM demonstrates the method on the safety/values subset (oversight, honesty, no-ends-justify-means — measured via agentic-misalignment), not on the warm/lighthearted/witty personality subset. The live open question — "is the warm/curious personality also subject to spec-science optimization?" (Model Spec Science) — is exactly the part Cat says Amanda does well but "doesn't describe the technique" for (Evals as Product Spec). So we have a proven method for the verifiable core of "who Claude is," and a still-tacit craft for the aesthetic skin on top.

Why these evals are durable#

Most harness assets shrink as models improve (Harness Shrinkage as Models Improve). Character is the documented exception — "capability changes between models; character should be stable" (Claude Character as Product) — and evals don't shrink either, because they "encode what we want, which the model still has to be measured against even as the model gets stronger" (Evals as Product Spec). A taste eval is therefore one of the few artifacts that compounds: it is the regression guardrail that lets you preserve identity across capability jumps rather than rebuild it each release.

Bottom line#

You don't write an eval for taste; you run a pipeline. Conviction supplies the rubric, dogfooding + introspection supply the failure modes and narrow the data search, and the MSM variant-comparison method turns the rubric into a measurable, regression-proof artifact. The method is proven on the safety/values core of character and still tacit on the aesthetic surface — which is precisely why "good at evals for taste" remains a rare, named skill rather than a documented procedure.