Sources#
Summary#
A framing introduced by Hua et al. (AgentOpt, 2026) that separates client-side optimization of agentic workflows — decisions under the developer's control such as which model to assign to each pipeline role, API budget allocation, and tool routing — from the server-side techniques (caching, scheduling, speculative execution, load balancing) that have dominated systems research on LLM serving. The central empirical claim is that model selection, evaluated at the level of full pipeline combinations rather than per-role in isolation, is the dominant efficiency lever: cost gaps between best and worst combinations at matched accuracy range from 13× to 32× across benchmarks, dwarfing what server-side optimizations can recover.
Details#
Server-Side vs. Client-Side#
Server-side systems (vLLM, SGLang, Autellix, ThunderAgent, Continuum, AIOS) optimize provider infrastructure across many users with objectives like throughput, tail latency, and cluster utilization. These objectives are generic because the provider cannot see the developer's specific utility function. Client-side optimization operates at the level of a specific workflow with an application-specific utility over quality, cost, and latency — a startup's coding assistant and a clinical-support system have incompatible preferences that can't be inferred from system-level signals.
The resources under client control:
- Foundation model pool — available API and local models
- Model-to-role assignment across planners, solvers, critics, retrievers
- Tool invocation policy — local vs. remote, when to skip
- API budget per step
- Application-level batching, caching, scheduling
Why Model Selection Is First-Class#
Model selection is upstream of every other client-side optimization: caching, routing heuristics, and speculative execution all operate conditional on a model assignment. Pick the wrong combination and no downstream optimization can close the gap.
The empirical evidence is striking. On BFCL, Qwen3 Next 80B matches Claude Opus 4.6 in accuracy at 32× lower cost. On MathQA, 24× gaps exist between comparably-accurate combinations.
The Combo Abstraction#
The paper's key conceptual contribution. In conventional LLM routing, each query is assigned to a cheaper or stronger model based on estimated difficulty — decisions are per-call. In multi-step agents, routing decisions are coupled across stages: a model's behavior in one role changes the intermediate state that later roles see. A planner that delegates to a tool creates different downstream work than a planner that answers from parametric knowledge.
Consequence: the unit of optimization is the full combination $\mathbf{c} = (m_1, \dots, m_H) \in \mathcal{M}^H$, not the per-role best. Performance rankings do not transfer cleanly across roles — a strong standalone model can be an excellent solver but a poor planner.
The canonical illustration from the paper, HotpotQA:
- Claude Opus 4.6 is the worst planner across 81 combinations — when used as planner it often answers directly from parametric knowledge and bypasses the solver's search tools.
- Ministral 3 8B is the best planner because it reliably delegates to the downstream solver.
- Ministral (planner) + Opus (solver) → 74.27%; Opus (planner) + Opus (solver) → 31.71%.
This is the same overthinking / overelaboration phenomenon described in Scale-Dependent Prompt Sensitivity, surfaced as a routing failure rather than a prompt-engineering failure.
Formulation as Black-Box Optimization#
Given pipeline roles $H$ and candidate set $M$, the combination space is $|M|^H$ — exponential. The utility function
$$J(\mathbf{c}) = \mathrm{PERF}(\tau(\mathbf{c})) - \lambda_c,\mathrm{COST}(\tau(\mathbf{c})) - \lambda_\ell,\mathrm{LATENCY}(\tau(\mathbf{c}))$$
is treated as an unknown black-box because cross-stage interactions are task-dependent and not analytically tractable.
Search Algorithms#
AgentOpt implements eight selectors sharing the same execution substrate:
- Arm Elimination (best-performing) — multi-armed bandit that prunes dominated combinations. Recovers near-optimal accuracy at 24–67% less evaluation budget vs. brute force on 3/4 benchmarks.
- Epsilon-LUCB — confidence-bound bandit
- Threshold Successive Elimination
- Bayesian Optimization
- Plus hill climbing, random search, and brute-force baselines
All selectors share the same API so strategies can be swapped without touching agent code.
Framework-Agnostic Interception#
The systems mechanism: patch httpx.Client.send and httpx.AsyncClient.send at the HTTP transport layer. Attribution of each call to (datapoint, combination) uses Python contextvars. This avoids per-framework SDK adapters — works across Langgraph, AutoGen, OpenClaw, Claude Code, and any agent using httpx under the hood.
The runtime also handles response caching (re-runs of the same (combo, datapoint) pair don't re-spend the API budget) and parallel execution (e.g., max_concurrent=20).
Output: a SelectionResults object exposing the Pareto frontier over (performance, cost, latency), with CSV export and YAML configuration export for deployment.
Separation of Policy and Execution#
Selectors (what to evaluate next) are separate from the runtime (how to execute, track, attribute, cache). This separation is what lets the eight algorithms share benchmarks — the search is the only variable.
Manual Levers in the Wild#
The client-side levers AgentOpt formalizes (model assignment, budget, caching, batching) appear as user-facing CLI commands in production agent tools. Hermes Agent is the most explicit:
| Hermes lever | AgentOpt analog |
|---|---|
/model (mid-session model switch) | per-role model assignment in the combo space |
/compress (summarize conversation) | application-level caching / context-budget management |
/usage, /insights | observability over the same cost/latency/perf signals AgentOpt uses for utility |
delegate_task (parallel subagents with isolated contexts) | sub-pipeline assignment with independent combos |
Bounded MEMORY.md (~2,200 chars), USER.md (~1,375 chars) | explicit budget envelope on persistent context |
| Prompt-cache discipline (avoid mid-session model/system-prompt changes) | the cache-stability constraint that makes per-session combo selection stable |
Significance: the levers exist in production tools today and are exercised manually by users. AgentOpt's contribution is automating selection over the same lever space rather than introducing new levers. A practical bridge would be an AgentOpt selector that drives Hermes's /model switches per-role given a benchmark, then writes the resulting combo into AGENTS.md for deployment.
The Hermes documentation also captures a constraint AgentOpt's combo abstraction implicitly relies on: don't break the prompt cache mid-session. Cache hits make per-message cost roughly constant; mid-session model/system-prompt changes invalidate that. If combo selection changes per-call rather than per-session, expected savings can be wiped out by cache misses — a deployment hazard worth surfacing when promoting AgentOpt's findings to production.
Connections#
- Scale-Dependent Prompt Sensitivity — AgentOpt's HotpotQA finding (Opus is the worst planner because it bypasses the solver) is the same overthinking / over-elaboration mechanism Hakim documents at the prompt level. One paper surfaces it as a routing failure, the other as a prompt-engineering failure; together they imply that large-model misuse is a systematic failure mode with two available mitigations (route around it, or constrain output)
- Agent Harness Engineering — client-side optimization is a layer above harness design: once the environment, progress logs, and verification loops are in place, combo selection chooses which models operate inside that harness. The JSON feature-list and progressive-disclosure patterns are execution substrate for the agents AgentOpt assigns
- Claude Code Best Practices — directly challenges the implicit "use the strongest model" default. AgentOpt's framework-agnostic httpx interception is also compatible with Claude Code's
claude -pnon-interactive mode, suggesting Claude Code pipelines can be subject to combo optimization - LLM-Driven Vulnerability Research — the file-ranking 1–5 pre-pass and the final validation agent are hand-tuned instances of exactly what AgentOpt searches over automatically. Treating the vuln-research scaffold as an AgentOpt pipeline (planner = file-ranker, solver = bug-finder, critic = validator) is a direct generalization
- LLM-as-Compiler Knowledge Base — the wiki's own compile / query / lint phases could be modeled as an agent pipeline where different phases run on different models (e.g., cheap model for index drift checks, strong model for cross-reference synthesis)
- Claude Opus 4.7 — the HotpotQA planner failure was measured on Opus 4.6; 4.7's literal instruction following may partially close that gap (needs re-measurement). Task budgets (public beta) echo AgentOpt's budget lever, but server-side rather than client-side
- Hermes Agent — production CLI agent that exposes the AgentOpt lever space (
/model,/compress,delegate_task, bounded memory, prompt-cache discipline) as user-facing commands; a natural integration target for AgentOpt selectors driving role assignment automatically - Symphony — at scale, ticket-driven orchestration makes per-pipeline combo selection operationally important: choosing the right model per ticket type (planner vs. solver vs. reviewer) inside
WORKFLOW.md's prompt template is a per-pipeline budget decision
Derived#
- When to Use Claude Opus 4.6 for Work — deployment rules drawn from the HotpotQA planner/solver results and the BFCL 32× cost-match finding
- Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations — role-based model selection principles applied to an Opus 4.7 multi-agent coding team
Open Questions#
- How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
- At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
- Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
- What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
- Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?
Sources#
13 articles link here
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
- EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- ConceptCodex App Server Protocol
JSON-RPC stdio protocol for headless Codex sessions: initialize/initialized/thread-start/turn-start handshake, continua…
- EntityHermes Agent
Nous Research's CLI agent + Gateway daemon (Telegram/Discord/Slack/WhatsApp); AGENTS.md/SOUL.md context split, bounded…
- ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- ConceptLLM-as-Compiler Knowledge Base
Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4…
- ConceptLLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
- EssayOpus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations
4.6→4.7 delta table + six hazards for multi-agent coding teams: role-based model selection, prompt re-tuning, harness i…
- ConceptScale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- EntitySymphony
OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace,…
- ConceptTicket-Driven Agent Orchestration
The inversion that makes Symphony work: tickets as units of work (not sessions/PRs), DAG dependencies, agent-extensible…
- EssayWhen to Use Claude Opus 4.6 for Work
Decision rules for Opus 4.6 deployment: solver-not-planner, elaboration-load-bearing tasks, brevity constraints, Pareto…
Related articles
- ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptScale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- ConceptClaude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
- EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
