H
Howardism
Plate IILLM ArchitectureHOWARDISM

LLM-as-a-Judge

PublishedJune 15, 2026FiledConceptDomainLLM ArchitectureTagsEvaluationLLM As A JudgeBenchmarksRubricsReading7 minSourceAI-synthesised

Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET + justification, weight-aggregated into normalized score and pass rate; key properties — rankings stay stable across judge models while absolute magnitudes vary, and adaptive per-case rubrics (Google's AutoRaters) detect failures but blend them away, motivating stable custom metrics for the behavior under change

Illustration for LLM-as-a-Judge

Sources#

Summary#

LLM-as-a-judge is the evaluation paradigm where one language model scores another model's outputs against explicit criteria — replacing (or scaling beyond) human graders for open-ended tasks that have no deterministic ground truth. It is the workhorse for grading anything where "is this output good?" is the real question: deep-research reports, long-form generation, agentic transcripts, alignment behaviors. DRACO (Perplexity, 2026) is the worked example this page is built on, but the primitive recurs across the wiki — from Anthropic's alignment audits to DeepMind's proof-search fitness.

The DRACO grading protocol (canonical shape)#

For each task, the judge evaluates the output against a task-specific rubric of weighted criteria. Per criterion:

  • The judge outputs a binary verdict — MET or UNMET — plus a short justification.
  • Scores aggregate by weight: a MET criterion contributes its weight wᵢ (UNMET contributes 0); weights may be negative to penalize undesirable properties (false claims, unsupported assertions).

Two reported numbers:

  • Normalized score = max(0, min(1, raw_score / Σ max(0, wᵢ))) × 100% — weighted by criterion importance.
  • Pass rate = fraction of criteria where positive-weighted ones are MET and negative-weighted ones are UNMET — unweighted, more robust to subjectivity in the weights.

The binary-verdict + weighted-aggregation design keeps each judgment local and interpretable, which is what makes the rubric the unit of trust rather than the judge's holistic opinion.

The adaptive-rubric variant: Google's AutoRaters#

Google's Gemini Enterprise Agent Platform AutoRaters (developed with Google DeepMind; the grading engine of the Agent Quality Flywheel) extend the primitive from fixed task rubrics to per-case adaptive rubrics for multi-turn agents: the judge extracts the user's intent from the conversation, generates rubric criteria specific to that case, validates the whole trace against each criterion, and majority-votes across samples. Two lessons carry beyond Google's stack:

  • Deltas over absolutes. Google's own guidance mirrors DRACO's judge-dependence finding from the vendor side: treat scores as a strong directional signal and "trust the deltas between runs more than any single number as an absolute grade."
  • Adaptive rubrics detect but don't isolate. Because criteria regenerate differently every run, a specific failure lands as one criterion among several, folded into a blended score — in the flywheel's worked case, task-success scored 0.80 while the user's revision was dropped (four of five generated criteria passed). There is no stable number to threshold or trend. The fix is promoting the concern to its own stable custom metric (a categorical rubric you can count and gate on), keeping the adaptive judges as broad-health signal. See Failures That Look Like Success for the failure class this hides.

The judge-dependence property#

DRACO's most transferable methodological lesson: relative rankings are stable across judge models, but absolute score magnitudes are not. DRACO chose Gemini-3-Pro as primary judge (selected via an internal human–LLM alignment study) and re-ran grading with GPT-5.2 and Sonnet-4.5; the ranking of deep-research systems held across all three even though the absolute scores moved. Practical consequences:

  • Use LLM-as-a-judge for ordinal comparisons (which system/version is better), and distrust cross-paper absolute-score comparisons that used different judges.
  • Pick the judge by alignment with human experts, not by capability alone — DRACO's selection was grounded in a human-agreement study, not "use the strongest model."
  • A judge can inherit its own biases into grading — a known confound when the judge and a graded model share lineage (see Automated Behavioral Audit, where a constitution-adherence variant graded by Opus 4.7 may inherit that model's biases).

Where it recurs in the wiki#

LLM-as-a-judge is the same primitive seen across very different domains, always doing the job of converting an open-ended quality question into a scored signal:

  • Alignment evaluationAutomated Behavioral Audit: an investigator model probes a target, and a separate judge model scores behavior across dozens of dimensions. Same architecture, applied to safety rather than research quality.
  • Formal proof searchEvolutionary Proof Search: cheaper LLM-critic rater agents assign relative fitness to incomplete proof sketches (a Plackett–Luce ranking), turning a binary compiler signal into a continuous gradient. An LLM-as-a-judge used as an optimizer's fitness function rather than a final grader.
  • Product evalsEvals as Product Spec: Cat Wu's "ten great evals" are runnable judgment-encoders; rubric-graded LLM scoring is how you scale "what does done look like?" to ambiguous AI features.

Limits#

  • Cost/alignment tradeoff. Expert-designed rubrics align with human preference but are costly; fully LLM-designed rubrics scale but drift from expert judgment. DRACO uses a hybrid (experts author/review with LLM assistance).
  • Not a ground-truth oracle. Unlike a Lean compiler (AI-Driven Formal Proof Search) or a passing test suite, an LLM judge is a fallible heuristic — its verdicts are themselves unverified. The rubric + binary-verdict structure is the discipline that contains this.
  • Self-grading and lineage bias. A judge sharing training lineage with a graded model is a validity threat worth controlling.

Connections#

  • DRACO Benchmark — the worked example: rubric-based binary-verdict grading with normalized score + pass rate, Gemini-3-Pro as judge
  • Automated Behavioral Audit — Anthropic's investigator-model + judge-model alignment evaluation; the same primitive applied to safety behaviors
  • Evolutionary Proof Search — LLM-critic rater agents as a fitness function: an LLM-as-a-judge used to score incomplete proof sketches
  • Evals as Product Spec — evals as the product-definition surface; LLM-as-a-judge is how rubric-style evals scale to open-ended output
  • Production-Sourced Evaluation — judge protocol pairs with production-sourced tasks to make DRACO an end-to-end automatable (but human-gated) eval
  • Deep Research Agents — the system class DRACO grades this way
  • AI-Driven Formal Proof Search — the verification-total contrast: a sound verifier needs no fallible judge
  • Verification as the New Bottleneck — LLM-as-a-judge is one (imperfect) answer to the verification-at-scale problem
  • Deployment Simulation — its graders (scoring resampled completions, classifying eval-vs-production) are LLM-as-judge detectors reused from known-undesired-behavior categories — the same primitive applied to pre-release safety forecasting
  • Agent Quality Flywheel — the productized eval-fix loop built on adaptive AutoRater judges plus stable custom rubrics
  • Optimizer–Evaluator Decoupling — the self-grading/lineage caveat elevated to an architectural rule: whatever proposes a change never grades it
  • Failures That Look Like Success — why blended adaptive scores miss single-criterion failures; the case for trace-level grading and metric promotion

Open questions#

  • How far can the judge's absolute calibration be trusted for thresholded decisions (ship/no-ship, RSP gating) as opposed to rankings?
  • Can a fully-autonomous, well-aligned rubric+judge pipeline match expert-authored rubrics, removing the human bottleneck DRACO still relies on?
  • When does judge-lineage bias actually flip a result, versus merely shift magnitudes?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 16
  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • Automated Behavioral Audit

    Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…

  • Deep Research Agents

    Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…

  • Deployment Simulation

    OpenAI's pre-release safety method: replay recent production conversations with a candidate model (strip the old final…

  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • Failures That Look Like Success

    The quiet agent-failure class where everything reads fine — confident answer, plausible plan, even correct internal sta…

  • Gemini Enterprise Agent Platform

    *Entity.* Google Cloud's agent platform: the GenAI evaluation service with adaptive AutoRaters (built with DeepMind), U…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 29 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Optimizer–Evaluator Decoupling

    The architectural rule in eval-fix loops that whatever proposes a fix (coding agent, automated optimizer, human) never…

  • Perplexity

    AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…

  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

Related articles
  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…

  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…