Perplexity

Sources#

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Summary#

Perplexity is an AI answer-engine / search company. In this corpus it appears as the author of the DRACO benchmark (arXiv:2602.11685, Feb 2026, with a Harvard co-author) and the maker of Perplexity Deep Research, the agentic deep-research system that tops every domain and rubric axis on that benchmark. It is the first vendor in the wiki whose contribution is a production-sourced evaluation built from its own deployment traffic.

What it does (in this corpus)#

Perplexity Deep Research — a deep-research agent that decomposes a query, iteratively retrieves from many sources, and synthesizes a cited report. On DRACO it scores 70.5% normalized (Opus 4.6 base) / 72.8% pass rate, leading Gemini Deep Research, OpenAI Deep Research (o3 / o4-mini), and bare Claude Opus 4.5/4.6-with-tools. Its profile: highest score and lowest latency among deep-research systems, with the largest input-token footprint (~779k/task) — retrieval-heavy, output-lean.
DRACO — Perplexity sampled tens of millions of its own Deep Research queries (Sep–Oct 2025), then de-identified, augmented, filtered, and curated them into 100 expert-rubric-graded tasks (Production-Sourced Evaluation). Publicly released on Hugging Face.

The notable structural fact: customer and competitor#

Perplexity Deep Research runs Claude Opus 4.5 / 4.6 as its base models (per the paper's experiment setting). So on DRACO, Perplexity's orchestrated product (Opus base) is benchmarked against the bare Anthropic Opus models — and beats them by ~10pp. Perplexity is simultaneously an Anthropic API customer and the entity demonstrating that its orchestration layer adds substantial value on top of Anthropic's model. This is the cleanest concrete instance of the "orchestration beyond the base model" finding, and a live counter-datapoint to Harness Shrinkage as Models Improve.

Connections#

DRACO Benchmark — the benchmark Perplexity authored and on which its product leads
Deep Research Agents — Perplexity Deep Research is the canonical leading instance of this system class
Production-Sourced Evaluation — DRACO's method: a benchmark built from Perplexity's own production traffic
Anthropic — Perplexity runs Claude Opus 4.5/4.6 as base models and benchmarks against bare Opus; customer and competitor at once
Google DeepMind — competitor (Gemini Deep Research is evaluated) whose Gemini-3-Pro Perplexity also chose as DRACO's primary judge model
LLM-as-a-Judge — DRACO's grading method; Perplexity selected the judge via a human-alignment study

Open questions#

A vendor publishing a benchmark its own product wins is an obvious incentive problem — how is DRACO's credibility maintained as it ages, and will Perplexity actually run the automatable refresh?
Perplexity depends on Anthropic (and others) for base models while competing with them on the end product — how durable is the orchestration advantage if base-model makers ship their own deep-research mode?

Sources#

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity — DRACO paper: Perplexity as author; Perplexity Deep Research as top-ranked system; Opus 4.5/4.6 as its base models; Gemini-3-Pro as judge