H
Howardism
Plate IILLM ArchitectureHOWARDISM

Production-Sourced Evaluation

PublishedJune 15, 2026FiledConceptDomainLLM ArchitectureTagsEvaluationBenchmarksData ProvenancePrivacyReading7 minSourceAI-synthesised

Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's central method — difficulty-proxied sampling, PII-stripping, augmentation, automatable refresh with a human QA gate; representativeness vs. over-specification tradeoff; production traffic as a proprietary eval asset

Illustration for Production-Sourced Evaluation

Sources#

Summary#

Production-sourced evaluation builds a benchmark from real, de-identified usage of a deployed system rather than from synthetic generation or hand-authored prompts. The argument: a benchmark's whole value is predicting real-world performance, and the most representative tasks are the ones users actually issued. DRACO (Perplexity, 2026) is the worked example — its 100 deep-research tasks are distilled from tens of millions of real Perplexity Deep Research queries — and the method is the paper's central contribution, distinct from the rubric design or the grading protocol.

The method (DRACO)#

  1. Difficulty-proxied sampling. Start from production traffic, biased toward hard cases: DRACO sampled 1,000 queries that drew subsequent negative sentiment or an explicit thumbs-down — i.e., the queries the deployed system handled worst. This mines failures the system already exhibits, which synthetic generation can't target.
  2. Privacy-preserving reformulation. An automated LLM pipeline strips PII and reduces ambiguity. Critically, no raw user query is ever exposed to a human analyst — anonymization is a precondition, enforced architecturally, not a cleanup step.
  3. Augmentation toward difficulty + specification. Real queries are often under-specified; augmentation adds context (persona, output format, sources) and broadens scope (temporal, comparative, geographic) so tasks become well-defined and challenging while still reflecting implicit user intent.
  4. Filtering for objective / tractable / difficult. Keep only tasks with convergent expert success criteria, bounded scope, and genuine difficulty.
  5. Human gate. A final in-house expert review for security and quality. The pipeline is automatable end-to-end but deliberately keeps a human as the last safety/quality gate.

The payoff DRACO claims: a benchmark that is representative (mirrors the real domain mix and real failure modes) and refreshable (because both research needs and usage evolve, the pipeline can regenerate fresh tasks rather than ossifying).

The core tradeoff: representativeness vs. over-specification#

Production-sourcing buys representativeness but the augmentation step that makes raw queries evaluable also threatens it. The paper is candid: systematic augmentation "reduces ambiguity and improves reproducibility, but it also risks over-specifying tasks and dampening the natural variability of user queries." The de-identification + augmentation pipeline turns a messy, personal, ambiguous query into a clean, bounded, comparable task — and some of what's stripped (ambiguity, personal context, the actual phrasing) is also part of what makes real usage real. Production-sourced is more representative than synthetic, but it is not raw production.

Why production traffic is a moat-grade eval asset#

The method only works if you have a large-scale deployed system generating the traffic — which is exactly the proprietary-data position described in Compounding Data Moat. Real usage at scale is "time-locked, context-specific, and impossible for a copycat to recreate"; here that same asset doubles as an evaluation substrate. A vendor with production traffic can build representative, difficulty-targeted, continuously-refreshed benchmarks that a competitor without deployment simply cannot — and can do so on tasks where its own product currently fails (the thumbs-down sampling). This is the data flywheel pointed at measurement: usage → failure signal → benchmark → product improvement.

The flip side is a credibility question (see DRACO Benchmark): a benchmark sourced from one vendor's traffic, on which that vendor's product wins, carries an obvious incentive — the human gate and the expert rubrics are partly there to answer it.

The product-loop form: Google's flywheel#

Google's Agent Quality Flywheel operationalizes the same principle as a continuous product loop rather than a benchmark. Agents emit OTel traces; each production session "is a genuine request… and each failure is a ready-made test case for the next cycle." Complete traces skip inference and are graded in place; Online Monitors score live traffic continuously, and drifting scores hand failing traces to the eval-fix loop. Google states the ordering explicitly: synthetic scenarios (its User Simulator) are a cold-start bootstrap — "synthetic scenarios get you moving; production data is what makes the loop sharp." That makes three independent arrivals at production-as-eval-substrate: DRACO (capability benchmark), Deployment Simulation (safety forecasting), and the flywheel (continuous quality monitoring) — the method crossing from benchmark construction into day-to-day product tooling.

Contrast with the alternatives#

  • Synthetic generation (DeepResearchEval, ReportBench, DeepScholar-Bench, DRBench) — scalable, no privacy exposure, but tasks are model-imagined and may miss real failure modes.
  • Hand-authored from interviews/searches (xBench, ResearcherBench, DEER) — human-authored and realistic, but bounded by author imagination and not drawn from a live production system.
  • Production-sourced (DRACO) — the only one of the three that mines the actual distribution and the actual failures, at the cost of needing deployment access and a privacy pipeline.

Connections#

  • DRACO Benchmark — the worked example; this method is its central contribution
  • LLM-as-a-Judge — the grading half of the pipeline; production-sourced tasks + rubric-judge grading make an automatable (human-gated) eval
  • Deep Research Agents — the system class whose production traffic DRACO mines
  • Compounding Data Moat — production usage as a time-locked proprietary asset; this is that asset repurposed as an evaluation substrate
  • Evals as Product Spec — "build your measurement framework before launch / from real usage"; production-sourced evaluation is that principle at benchmark scale
  • Task Time-Horizon Scaling — sibling concern: benchmarks saturate, so the ability to refresh from live usage is what keeps an eval alive
  • Automated Behavioral Audit — the alignment-side analog notes its synthetic scenarios "may not match real-traffic distributions" — exactly the gap production-sourcing closes
  • Telemetry vs. Survey MeasurementFaros AI's telemetry-over-survey stance is the engineering-metrics sibling: measure from the real system, not from self-report
  • Deployment Simulation — the alignment-side application of the same method: OpenAI replays de-identified production conversations to forecast safety behavior pre-release, where DRACO replays them to build a capability benchmark; same PII pipeline, same proprietary-traffic moat
  • Conversation-to-Delegation Shift — its measurement-obsolescence argument is the same instinct one step further: as usage becomes delegation, even which metrics to read (complexity, runtime, concurrency, output) must be re-sourced from real agentic behavior, not interaction counts
  • Agent Quality Flywheel — the continuous product-loop form: OTel production traces graded in place, Online Monitors on live traffic, synthetic simulation demoted to cold-start bootstrap
  • Failures That Look Like Success — the failure class production-scale traces could quantify: silent contract violations that demo-sized evals only sample

Open questions#

  • How much does augmentation distort the distribution it claims to represent? Is there a measurable representativeness loss between raw queries and augmented tasks?
  • Difficulty-by-thumbs-down biases toward current failures — does that make the benchmark a moving target that flatters the next model trained on those failures?
  • Can the privacy pipeline (no human sees raw queries) be trusted/audited well enough for regulated domains (medicine, law) where the source traffic is most sensitive?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 16
  • Agent Quality Flywheel

    Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…

  • Automated Behavioral Audit

    Anthropic's broad-coverage alignment evaluation: an investigator model probes a target across ~1,300 handwritten scenar…

  • Compounding Data Moat

    Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…

  • Conversation-to-Delegation Shift

    OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…

  • Deep Research Agents

    Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…

  • Deployment Simulation

    OpenAI's pre-release safety method: replay recent production conversations with a candidate model (strip the old final…

  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Failures That Look Like Success

    The quiet agent-failure class where everything reads fine — confident answer, plausible plan, even correct internal sta…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • LLM Architecture, Training & Alignment

    Map of Content for the llm-architecture domain — 29 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • OpenAI

    AI lab and maker of the GPT-5 series and Codex; in this corpus it appears as a frontier-safety research source (Deploym…

  • Perplexity

    AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

  • Telemetry vs. Survey Measurement

    Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…

Related articles
  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • Verification as the New Bottleneck

    Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…