Sources#
Summary#
DRACO (Deep Research Accuracy, Completeness, and Objectivity) is a benchmark of 100 complex, open-ended deep-research tasks spanning 10 domains and requiring information from 40 countries, published by Perplexity (with a Harvard co-author) in February 2026 (arXiv:2602.11685). Its distinguishing feature: the tasks are drawn from real, de-identified production usage of Perplexity Deep Research (Production-Sourced Evaluation) rather than synthetic or hand-authored prompts, then paired with task-specific expert rubrics and graded by an LLM-as-a-judge. It is a benchmark of systems / products, not base models — which is what makes its headline finding (orchestration beats the bare model) legible.
Why it's different (Table 1)#
DRACO is positioned as the first deep-research benchmark to be simultaneously: production-sourced, human-authored, general-domain (not just specialized/technical), and expert-rubric-graded. Prior open-ended benchmarks each miss at least one — DeepResearchEval, ReportBench, DeepScholar-Bench, and DRBench rely on synthetic task generation; others are hand-authored but narrow or lack expert rubrics. None draws directly from a widely-available production deep-research system.
Task construction (5 stages)#
Sourced from production Perplexity Deep Research queries, then reformulated/augmented/filtered so tasks are anonymous, well-specified, bounded, challenging, and representative:
- Sampling — 1,000 high-difficulty English queries (Sep–Oct 2025), difficulty proxied by subsequent negative sentiment or a thumbs-down on the prior response.
- Pre-processing — LLM reformulation to strip PII and reduce ambiguity; fully automated, no raw query ever seen by a human analyst (privacy by design).
- Augmentation — systematic expansion along two axes: context (persona, output format, source specificity) and scope (temporal, cross-entity comparison, geography). Turns ambiguous queries into well-defined tasks reflecting implicit user intent.
- Filtering — LLM keeps only tasks that are objective (experts converge on what's good), tractable (bounded), and difficult (needs nontrivial multi-step gathering/synthesis).
- Curation — 100 tasks sampled to match the real domain distribution, then manually reviewed by in-house domain experts.
The 10 domains: Finance, Shopping/Product Comparison, Academic, Technology, General Knowledge, UX Design, Law, Medicine, Needle in a Haystack, Personalized Assistant.
Rubric design and grading#
Rubrics were built with 26 recruited domain experts (doctors, attorneys, financial analysts, engineers, designers) over a 4-stage pipeline with LLM assistance, including a saturation test — if the leading system already scored >90% on a task, it was sent back for hardening (~45% of tasks were). Each task carries ~39.3 weighted criteria across four axes; about half target factual accuracy. Criteria are positive (desirable properties) or negative (pitfalls), with the harshest penalties reserved for harmful medical content (down to −500).
| Axis | Weight Range | ~Criteria/task |
|---|---|---|
| Factual Accuracy | −500 to +20 | 20.5 |
| Breadth & Depth of Analysis | −100 to +10 | 8.6 |
| Presentation Quality | −50 to +20 | 5.6 |
| Citation Quality | −150 to +10 | 4.8 |
Grading uses an open-source LLM-as-a-judge protocol: per-criterion binary MET/UNMET → weighted normalized score (0–100%) and pass rate. Judge = Gemini-3-Pro (chosen via an internal human–LLM alignment study); GPT-5.2 and Sonnet-4.5 corroborate. Rankings are stable across judges; absolute magnitudes vary.
Headline results#
Perplexity Deep Research leads every domain and every rubric axis. Among deep-research systems:
| System | Normalized | Pass rate |
|---|---|---|
| Perplexity Deep Research (Opus 4.6) | 70.5 | 72.8 |
| Perplexity Deep Research (Opus 4.5) | 67.2 | 70.9 |
| Gemini Deep Research | 59.0 | 62.7 |
| OpenAI Deep Research (o3) | 52.1 | 56.9 |
| OpenAI Deep Research (o4-mini) | 41.9 | 48.0 |
| Claude Opus 4.6 (bare + tools) | 59.8 | 63.1 |
| Claude Opus 4.5 (bare + tools) | 46.7 | 50.2 |
Three findings that matter for this wiki:
- Orchestration > base model. Perplexity (Opus 4.6 base) beats bare Opus 4.6-with-tools by ~10pp — see Deep Research Agents. A live counter-datapoint to Harness Shrinkage as Models Improve.
- Claude Opus 4.6 is the strongest non-Perplexity system (59.8% / 63.1%), ahead of Gemini Deep Research and both OpenAI configs. Opus 4.6 ranks second (non-Perplexity) in 5 of 10 domains.
- Factual accuracy / citation are the universal weak axes; presentation is strongest everywhere. The Perplexity-vs-second gap is largest in Finance (21.6pp) and smallest in Law (1.6pp).
Limitations (the paper's own)#
Single-turn only (no clarifying-question / multi-turn capability tested); a static snapshot despite an automatable refresh pipeline; text-only (no multimodal); English-only; augmentation risks over-specifying away natural query variability; rubric creation still needs heavy human-expert involvement; and absolute scores depend on the LLM judge (though rankings don't). System-level (black-box) evaluation — no component-level attribution of retrieval vs. planning vs. synthesis.
Connections#
- Deep Research Agents — the system class DRACO evaluates; home of the orchestration / verification / efficiency findings
- Production-Sourced Evaluation — DRACO's central methodological contribution: tasks built from real de-identified production traffic
- LLM-as-a-Judge — the rubric-based binary-verdict grading protocol DRACO uses
- Task Time-Horizon Scaling — sibling capability benchmark; where METR measures task length a model sustains, DRACO measures research-report quality of agentic systems, and both note benchmark-saturation pressure (DRACO's saturation test discards >90%-solved tasks)
- Harness Shrinkage as Models Improve — DRACO's orchestration-beats-bare-model result is a counter-datapoint to the shrinking-harness thesis
- Verification as the New Bottleneck — factual-accuracy weakness across all systems is verification surfacing inside the research product
- Evals as Product Spec — DRACO is the externalized, large-scale form of "evals as the definition of done," with rubrics standing in for the eval set
- Perplexity / Anthropic / Google DeepMind — benchmark author; makers of evaluated systems and the judge model
Open questions#
- The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
- Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
- Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?
Sources#
- DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity — full paper (arXiv:2602.11685, Perplexity + Harvard, Feb 2026): task construction, rubric pipeline, grading protocol, and all result tables
Cited by 10
- Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- Google DeepMind
Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- LLM Architecture, Training & Alignment
Map of Content for the llm-architecture domain — 29 concepts. Curated entry point; see Home for all domains.
- Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
- Perplexity
AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…
- Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
- Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
Related articles
- Deep Research Agents
Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited…
- LLM-as-a-Judge
Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…
- Production-Sourced Evaluation
Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…
- Perplexity
AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…
- Agent Quality Flywheel
Google's eval-fix loop packaged as a skill your coding agent drives: Build & Test → Ship & Monitor → Learn & Refine, ex…
