H
Howardism
Plate IIAI EngineeringHOWARDISM

Deep Research Agents

PublishedJune 15, 2026FiledConceptDomainAI EngineeringTagsAgent EngineeringDeep ResearchRetrievalOrchestrationReading7 minSourceAI-synthesised

Agentic systems that decompose a complex query, iteratively search diverse sources, and synthesize a structured, cited report — distinct from single-shot QA; DRACO shows orchestration (Perplexity) beats the bare base model with tools, and factual accuracy is the weak axis

Illustration for Deep Research Agents

Sources#

Summary#

A deep research agent is an agentic AI system that takes a complex, open-ended query and: (1) decomposes it into constituent sub-workflows, (2) iteratively searches diverse external sources, and (3) synthesizes the gathered evidence into a structured, cited report. Unlike single-shot question answering, it interleaves multi-step planning and reasoning with autonomous retrieval and evaluation — verifying claims, resolving conflicting evidence, and identifying gaps in the literature. The output is an analysis whose breadth and depth would otherwise require extensive human-expert effort to produce. This is the system class that DRACO (Perplexity, Feb 2026) was built to evaluate, and the four production systems it benchmarks — Perplexity Deep Research, OpenAI Deep Research, Gemini Deep Research, and Claude Opus with web-search + code-execution tools — are the canonical instances.

The pattern vs. single-shot QA#

Single-shot QADeep research agent
Planningnone / implicitexplicit query decomposition into sub-workflows
Retrievalone pass (or none)iterative, multi-source, autonomous
Reasoningwithin one generationmulti-step; verify, resolve conflicts, find gaps
Outputan answera structured, cited report
Effort replaceda lookuphours of human-expert research

Deep research is increasingly load-bearing in knowledge-intensive domains — academic research, medical decision support, legal analysis, financial analysis — where the bar is comprehensive, in-depth, transparent, and verifiable reasoning over large, heterogeneous corpora.

Orchestration beats the bare model (the DRACO finding)#

The most consequential result for this wiki: on DRACO, Perplexity Deep Research (Opus 4.5 / 4.6 base) substantially outperforms bare Claude Opus 4.5 / 4.6 with web_search and code_execution tools — 70.5% vs 59.8% normalized for the 4.6 pairing. The same base model, wrapped in a purpose-built retrieval-and-synthesis harness, gains ~10 percentage points. The paper's own gloss: this indicates "the importance of agent orchestration beyond the base model."

This is a live counter-datapoint to Harness Shrinkage as Models Improve. The wiki's recurring thesis is that scaffolding shrinks as models improve and mechanical verification is what stays load-bearing. Deep research is a domain where, as of early 2026, the harness still carries a large, measurable share of system quality — the orchestration layer (query decomposition, iterative retrieval strategy, source selection, synthesis discipline) has not dissolved into the base model. Whether it shrinks as models cross the next capability thresholds is the open question; DRACO is the current measurement.

Verification is the binding constraint#

Across every system DRACO grades, the ranking by rubric axis is consistent: strongest on presentation quality, weakest on factual accuracy and citation quality. Fluency is solved; verifiable correctness is not. This is Verification as the New Bottleneck surfacing inside the research product itself — the hard part is no longer producing a readable report but ensuring every claim in it is true and properly sourced. It is the open-domain mirror of AI-Driven Formal Proof Search, where a compiler makes verification total; deep research has no such oracle, so accuracy/citation become the frontier.

Efficiency: more tokens ≠ better#

DRACO's token/latency table breaks the intuition that longer, more expensive runs win:

  • The top scorer (Perplexity, Opus 4.6) also had the lowest latency among deep-research systems (245s), despite the largest input-token footprint (~779k tokens/task) — input-heavy retrieval, lean output (~8.8k tokens).
  • OpenAI o3 and Gemini produced the most output (24.9k, 22.1k tokens) yet scored mid-pack — verbosity did not buy quality.
  • OpenAI o4-mini was the most token-efficient overall (~53.5k total) but lagged on score (41.9%).

The shape — quality decoupled from output length, input-token spend doing the real work — is the deep-research instance of the cost/quality combo tradeoffs formalized in Client-Side Agent Optimization (model-per-role, budget, routing): the lever that matters is orchestration design, not raw token expenditure.

Where it sits#

Deep research is a long-horizon, autonomous, multi-step task — exactly the regime Task Time-Horizon Scaling measures — run as a continuous retrieval-and-synthesis harness. It is one of the clearest current cases where a product (the harness + orchestration) is worth substantially more than the model it wraps, which is why a benchmark of systems (not models) like DRACO is the right instrument, graded by LLM-as-a-judge against expert rubrics built from real production usage.

Connections#

  • DRACO Benchmark — the benchmark built to evaluate this system class; source of the orchestration, verification, and efficiency findings here
  • Agent Harness Engineering — deep research is a retrieval-and-synthesis harness; the "orchestration beyond the base model" result is direct evidence that this harness layer is load-bearing
  • Harness Shrinkage as Models Improve — counter-datapoint: here the harness has not shrunk into the model (~10pp gap between orchestrated and bare-model-with-tools)
  • Verification as the New Bottleneck — factual accuracy / citation are the weakest axes across all systems; verifiable correctness is the frontier
  • Task Time-Horizon Scaling — deep research is a long-horizon autonomous task of exactly the kind METR's time-horizon metric measures
  • Client-Side Agent Optimization — the token/latency tradeoffs (more output ≠ better; orchestration > raw spend) are the deep-research instance of combo/budget optimization
  • LLM-as-a-Judge — how DRACO grades deep-research outputs against task-specific rubrics
  • Production-Sourced Evaluation — how DRACO's tasks were sourced (from real Perplexity Deep Research traffic) so the benchmark reflects actual use
  • AI-Driven Formal Proof Search — the verification-total contrast: a sound verifier eliminates the accuracy gap that deep research can't close
  • Perplexity — builder of the leading deep-research system and of DRACO
  • Anthropic / Google DeepMind — makers of evaluated systems (Claude Opus; Gemini Deep Research, and Gemini-3-Pro as judge)
  • Repository Exploration Subagent — structurally parallel inside a coding agent: FastContext decomposes a task into exploration + solving and isolates iterative search behind a compact synthesized return — the same decompose→search→synthesize shape deep research applies to the web

Open questions#

  • Does the orchestration advantage shrink as base models cross the next thresholds, or is open-ended retrieval/synthesis a durable harness asset (unlike, say, prompt scaffolding)?
  • DRACO grades single-turn interactions only. How much of real deep-research value is in the multi-turn loop (clarifying questions, follow-ups) that the benchmark doesn't yet measure?
  • Factual accuracy is the weak axis everywhere — is the fix better retrieval, better verification-in-the-loop, or a tool-grounded check the way Lean grounds proof search?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 11
  • Agent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • Client-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • LLM-as-a-Judge

    Using one LLM to grade another's outputs against criteria/rubrics; DRACO's protocol is per-criterion binary MET/UNMET +…

  • AI Engineering & Agent Tooling

    Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Perplexity

    AI answer-engine company; maker of Perplexity Deep Research (the leading system on its own DRACO benchmark) and publish…

  • Production-Sourced Evaluation

    Building benchmarks from de-identified real production usage rather than synthetic or hand-authored tasks; DRACO's cent…

  • Repository Exploration Subagent

    FastContext's thesis that repository exploration (read/search/localization) should be decoupled from solving into a ded…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles
  • DRACO Benchmark

    Perplexity's benchmark of 100 production-sourced deep-research tasks (10 domains, 40 countries) graded by 26-expert rub…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Repository Exploration Subagent

    FastContext's thesis that repository exploration (read/search/localization) should be decoupled from solving into a ded…

  • Google DeepMind

    Google's AI lab; built AlphaProof Nexus; Gemini models, AlphaProof, AlphaEvolve; opens the AI-for-mathematics domain an…