Sources#
- AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
- Auto mode for Claude Code
- Best Practices for Claude Code
- Brevity Constraints Reverse Performance Hierarchies in Language Models
- Introducing Claude Opus 4.7
Question#
What's the difference between Opus 4.6 & 4.7, and what should I pay attention to when working in a multi-team agent setup for coding?
Part 1 — What Changed Between 4.6 and 4.7#
Same price ($5/M input, $25/M output), same product position, new API ID (claude-opus-4-7). Direct upgrade. Details and nuances from Claude Opus 4.7:
| Axis | 4.7 vs. 4.6 | Why it matters for a multi-agent setup |
|---|---|---|
| Hardest coding tasks | Marketed for "hand off your hardest work"; SOTA Finance Agent and GDPval-AA | Raises the per-role ceiling where Opus is the right choice |
| Instruction following | Literal. Skips/loosens less. | Prompts and CLAUDE.md hedges written for 4.6 may misbehave — biggest migration hazard |
| Vision | Up to 2,576 px long edge (~3.75 MP, >3× prior) | Computer-use agents reading dense screenshots improve |
| File-system memory | Stronger across multi-session work | Favors repo-local versioned artifacts as shared memory between agents |
| Safety | Better on prompt-injection and honesty; slightly weaker on harm-reduction over-elaboration | Prompt-injection resistance specifically matters when one agent consumes another's output |
| Cyber capability | Differentially reduced during training + request-level classifier (post-Glasswing) | Legitimate security automation now routes through Cyber Verification Program |
| Effort levels | New xhigh between high and max. Claude Code default raised to xhigh on all plans | Per-agent token cost at default increased |
| Tokenizer | Same input → 1.0–1.35× more tokens | Context budget shrinks per agent unless prompts are re-measured |
| Per-turn thinking | Thinks more at higher effort, especially on later agentic turns | Compounds tokenizer inflation across multi-turn orchestration |
| Accompanying launches | Task budgets (API public beta), /ultrareview, auto mode extended to Max | Server-side budget lever arrives alongside auto mode for unattended runs |
Compounded token-budget hit: tokenizer inflation × xhigh default × more per-turn thinking. The naive "drop-in upgrade" consumes more context than 4.6 on the same prompts. Anthropic explicitly recommends measuring on real traffic rather than trusting the net-favorable claim on their internal coding eval.
Part 2 — What to Pay Attention to in a Multi-Agent Coding Setup#
The research in this wiki points at six concrete hazards when moving from a single-agent workflow to a multi-agent team on Opus 4.7.
1. Role-Based Model Selection, Not Strongest-Model Default#
From Client-Side Agent Optimization (AgentOpt, Hua et al., 2026) measured on Opus 4.6:
| Combination (planner + solver on HotpotQA) | Accuracy |
|---|---|
| Ministral 3 8B + Opus | 74.27% |
| Opus + Opus | 31.71% |
Opus-as-planner bypasses the downstream solver's search tools and answers from parametric knowledge. The combo is the right unit of analysis, not per-role accuracy in isolation.
Rule:
- Put a smaller, more obedient model in planner / router / file-ranker / decomposer seats.
- Reserve Opus 4.7 for synthesis, integrative reasoning, final-answer generation, and code-review-over-multiple-files roles.
- Before committing Opus to a role, check whether a cheaper model matches accuracy. On BFCL, Qwen3 Next 80B matched Opus 4.6 at 32× lower cost. 4.7's tokenizer inflation makes this gap worse, not better.
Open question on 4.7: literal instruction following may narrow the planner gap. Don't assume it — re-measure. See When to Use Claude Opus 4.6 for Work addendum for the per-rule 4.7 predictions.
2. Re-Tune Prompts From the 4.6 Era#
Opus 4.7's literal instruction following is the single biggest migration hazard for existing multi-agent orchestration code. Prompts that worked on 4.6 because it loosely interpreted "or similar," "prefer X," "try to," or skipped apparently-optional steps may now behave rigidly.
Audit list:
- CLAUDE.md files and system prompts containing hedges
- Multi-agent role cards that relied on the model "knowing when to delegate"
- Chained prompts where a later step assumed an earlier step would be skipped
- Prompts that assumed Opus would correct ambiguous instructions rather than execute them as written
3. Harness-Level Invariants, Not Prompt Advisories#
From Agent Harness Engineering and Scale-Dependent Prompt Sensitivity:
- Enforce output constraints mechanically — structured-output schemas, length caps, response validators,
/ultrareview-style reviewer passes. - Brevity constraints recovered +26.3pp on overthinking-prone problems on 4.6. 4.7's literal instruction following may make these more effective (the model obeys word caps). Use them.
- Diagnostic exception: BoolQ and similar cross-sentence integration tasks — brevity hurts. Don't cap reasoning-product outputs.
Large-model verbosity stacks in a multi-agent pipeline: each agent's output becomes another agent's context. Errors compound, context fills, reasoning degrades.
4. Context Budget Management Is Per-Agent#
Claude Code Best Practices applies to each agent in isolation. With 4.7 default xhigh + tokenizer inflation + more per-turn thinking, multi-agent handoffs consume budget faster. Tactics:
- Summarized handoffs, not full-context passthrough
- Subagents in isolated context (Claude Code pattern): investigate separately, return a summary
- Writer/Reviewer pattern with fresh context on the reviewer side — especially relevant now that
/ultrareviewis a dedicated 4.7 primitive - Lower effort for non-hard steps. Anthropic recommends
highorxhighfor coding/agentic;maxis rarely worth it - Task budgets (API public beta): per-phase spend caps as a server-side analogue to AgentOpt's budget lever
5. Unattended Fan-Out Safety#
From Claude Code Auto Mode (now extended to Max users alongside 4.7):
- Auto mode is strictly safer than
--dangerously-skip-permissionsfor multi-agent fan-out — a classifier pre-flights every tool call and redirects Claude on risky actions. - It is not a substitute for isolated environments. Anthropic documents two classifier failure modes: ambiguous intent and missing environment context.
- In non-interactive mode, auto mode aborts on repeated blocks rather than hanging on an un-answerable prompt — preserves the fan-out use case.
When an agent fans out commands with destructive shape (migrations, deletes, deploys) across a team: layer auto mode inside a sandboxed container or worktree. Defense in depth.
6. Agent-Review Pattern Beats Self-Verification#
From Claude Code Best Practices: the Writer/Reviewer pattern — one agent implements, a second reviews with fresh context — reduces the "own-code blind spot." With Opus 4.7:
/ultrareviewis the vendored form of this pattern; Pro and Max users get 3 free ultrareviews for evaluation- File system memory improvements mean the reviewer can read the writer's progress log and git history efficiently, not just the diff
- For unattended multi-team workflows: route reviewer-agent output back through a validator agent (mirrors the final-validation-agent pattern in LLM-Driven Vulnerability Research) rather than trusting single-agent reviews
Decision Summary#
| Situation | Action |
|---|---|
| Multi-agent team currently on Opus 4.6 all-Opus pipeline | Don't migrate wholesale. Audit roles. Downgrade planner/router to cheap models before switching solver to 4.7 |
| 4.6 prompts carry implicit loose-interpretation | Re-tune for literal instruction following before trusting 4.7 output |
| Context budget felt tight on 4.6 | Will feel tighter on 4.7. Use summarized handoffs + subagents + lower effort for non-hard steps |
Production unattended fan-out currently uses --dangerously-skip-permissions | Switch to auto mode (now available on Max) + isolated environment |
| No harness-level output constraints today | Add schemas, length caps, validators before scaling to multi-agent |
| No independent-reviewer step | Add one — /ultrareview or Writer/Reviewer with fresh context |
| "Strongest model everywhere" default | Re-check Pareto frontier on 4.7 — tokenizer inflation shifts it |
Two Underlying Principles#
Both failure modes specific to Opus on 4.6 — worst-planner-in-class and overthinking-on-short-answers — share one mechanism: scale-dependent overthinking. 4.7's literal instruction following may dampen it; 4.7's xhigh default and more-per-turn thinking cut the other way. Net direction is empirical.
The safe meta-rule for multi-agent coding teams: don't inherit 4.6 deployment decisions on faith. Measure on your workload, then decide. When to Use Claude Opus 4.6 for Work's five rules are the current best defaults — re-validate them as your multi-agent setup matures on 4.7.
Sources#
- Claude Opus 4.7 — 4.7 capability and token-economics deltas
- Claude Code Best Practices — context-budget constraint, subagents, Writer/Reviewer, session management, scaling patterns
- Claude Code Auto Mode — classifier-gated permission middle-ground
- Client-Side Agent Optimization — combo selection, Opus-as-planner failure mode, Pareto frontier
- Scale-Dependent Prompt Sensitivity — brevity constraints, overthinking mechanism, BoolQ exception
- Agent Harness Engineering — enforce invariants mechanically, progressive disclosure, doc gardening
- LLM-Driven Vulnerability Research — final-validation-agent pattern as template for multi-agent review
- When to Use Claude Opus 4.6 for Work — per-rule 4.7 predictions in addendum
- Introducing Claude Opus 4.7
- Best Practices for Claude Code
- Auto mode for Claude Code
- AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
- Brevity Constraints Reverse Performance Hierarchies in Language Models
6 articles link here
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptClaude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
- ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
- EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- ConceptClient-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
- ConceptScale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
Related articles
- ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
- ConceptLLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
- ConceptClient-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
