Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 51HOWARDISM

Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations

PublishedApril 17, 2026FiledEssayReading9 minSourceAI-synthesised

4.6→4.7 delta table + six hazards for multi-agent coding teams: role-based model selection, prompt re-tuning, harness invariants, per-agent context budget, unattended-fan-out safety, independent reviewer

Illustration for Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations

Sources#

Question#

What's the difference between Opus 4.6 & 4.7, and what should I pay attention to when working in a multi-team agent setup for coding?

Part 1 — What Changed Between 4.6 and 4.7#

Same price ($5/M input, $25/M output), same product position, new API ID (claude-opus-4-7). Direct upgrade. Details and nuances from Claude Opus 4.7:

Axis4.7 vs. 4.6Why it matters for a multi-agent setup
Hardest coding tasksMarketed for "hand off your hardest work"; SOTA Finance Agent and GDPval-AARaises the per-role ceiling where Opus is the right choice
Instruction followingLiteral. Skips/loosens less.Prompts and CLAUDE.md hedges written for 4.6 may misbehave — biggest migration hazard
VisionUp to 2,576 px long edge (~3.75 MP, >3× prior)Computer-use agents reading dense screenshots improve
File-system memoryStronger across multi-session workFavors repo-local versioned artifacts as shared memory between agents
SafetyBetter on prompt-injection and honesty; slightly weaker on harm-reduction over-elaborationPrompt-injection resistance specifically matters when one agent consumes another's output
Cyber capabilityDifferentially reduced during training + request-level classifier (post-Glasswing)Legitimate security automation now routes through Cyber Verification Program
Effort levelsNew xhigh between high and max. Claude Code default raised to xhigh on all plansPer-agent token cost at default increased
TokenizerSame input → 1.0–1.35× more tokensContext budget shrinks per agent unless prompts are re-measured
Per-turn thinkingThinks more at higher effort, especially on later agentic turnsCompounds tokenizer inflation across multi-turn orchestration
Accompanying launchesTask budgets (API public beta), /ultrareview, auto mode extended to MaxServer-side budget lever arrives alongside auto mode for unattended runs

Compounded token-budget hit: tokenizer inflation × xhigh default × more per-turn thinking. The naive "drop-in upgrade" consumes more context than 4.6 on the same prompts. Anthropic explicitly recommends measuring on real traffic rather than trusting the net-favorable claim on their internal coding eval.

Part 2 — What to Pay Attention to in a Multi-Agent Coding Setup#

The research in this wiki points at six concrete hazards when moving from a single-agent workflow to a multi-agent team on Opus 4.7.

1. Role-Based Model Selection, Not Strongest-Model Default#

From Client-Side Agent Optimization (AgentOpt, Hua et al., 2026) measured on Opus 4.6:

Combination (planner + solver on HotpotQA)Accuracy
Ministral 3 8B + Opus74.27%
Opus + Opus31.71%

Opus-as-planner bypasses the downstream solver's search tools and answers from parametric knowledge. The combo is the right unit of analysis, not per-role accuracy in isolation.

Rule:

  • Put a smaller, more obedient model in planner / router / file-ranker / decomposer seats.
  • Reserve Opus 4.7 for synthesis, integrative reasoning, final-answer generation, and code-review-over-multiple-files roles.
  • Before committing Opus to a role, check whether a cheaper model matches accuracy. On BFCL, Qwen3 Next 80B matched Opus 4.6 at 32× lower cost. 4.7's tokenizer inflation makes this gap worse, not better.

Open question on 4.7: literal instruction following may narrow the planner gap. Don't assume it — re-measure. See When to Use Claude Opus 4.6 for Work addendum for the per-rule 4.7 predictions.

2. Re-Tune Prompts From the 4.6 Era#

Opus 4.7's literal instruction following is the single biggest migration hazard for existing multi-agent orchestration code. Prompts that worked on 4.6 because it loosely interpreted "or similar," "prefer X," "try to," or skipped apparently-optional steps may now behave rigidly.

Audit list:

  • CLAUDE.md files and system prompts containing hedges
  • Multi-agent role cards that relied on the model "knowing when to delegate"
  • Chained prompts where a later step assumed an earlier step would be skipped
  • Prompts that assumed Opus would correct ambiguous instructions rather than execute them as written

3. Harness-Level Invariants, Not Prompt Advisories#

From Agent Harness Engineering and Scale-Dependent Prompt Sensitivity:

  • Enforce output constraints mechanically — structured-output schemas, length caps, response validators, /ultrareview-style reviewer passes.
  • Brevity constraints recovered +26.3pp on overthinking-prone problems on 4.6. 4.7's literal instruction following may make these more effective (the model obeys word caps). Use them.
  • Diagnostic exception: BoolQ and similar cross-sentence integration tasks — brevity hurts. Don't cap reasoning-product outputs.

Large-model verbosity stacks in a multi-agent pipeline: each agent's output becomes another agent's context. Errors compound, context fills, reasoning degrades.

4. Context Budget Management Is Per-Agent#

Claude Code Best Practices applies to each agent in isolation. With 4.7 default xhigh + tokenizer inflation + more per-turn thinking, multi-agent handoffs consume budget faster. Tactics:

  • Summarized handoffs, not full-context passthrough
  • Subagents in isolated context (Claude Code pattern): investigate separately, return a summary
  • Writer/Reviewer pattern with fresh context on the reviewer side — especially relevant now that /ultrareview is a dedicated 4.7 primitive
  • Lower effort for non-hard steps. Anthropic recommends high or xhigh for coding/agentic; max is rarely worth it
  • Task budgets (API public beta): per-phase spend caps as a server-side analogue to AgentOpt's budget lever

5. Unattended Fan-Out Safety#

From Claude Code Auto Mode (now extended to Max users alongside 4.7):

  • Auto mode is strictly safer than --dangerously-skip-permissions for multi-agent fan-out — a classifier pre-flights every tool call and redirects Claude on risky actions.
  • It is not a substitute for isolated environments. Anthropic documents two classifier failure modes: ambiguous intent and missing environment context.
  • In non-interactive mode, auto mode aborts on repeated blocks rather than hanging on an un-answerable prompt — preserves the fan-out use case.

When an agent fans out commands with destructive shape (migrations, deletes, deploys) across a team: layer auto mode inside a sandboxed container or worktree. Defense in depth.

6. Agent-Review Pattern Beats Self-Verification#

From Claude Code Best Practices: the Writer/Reviewer pattern — one agent implements, a second reviews with fresh context — reduces the "own-code blind spot." With Opus 4.7:

  • /ultrareview is the vendored form of this pattern; Pro and Max users get 3 free ultrareviews for evaluation
  • File system memory improvements mean the reviewer can read the writer's progress log and git history efficiently, not just the diff
  • For unattended multi-team workflows: route reviewer-agent output back through a validator agent (mirrors the final-validation-agent pattern in LLM-Driven Vulnerability Research) rather than trusting single-agent reviews

Decision Summary#

SituationAction
Multi-agent team currently on Opus 4.6 all-Opus pipelineDon't migrate wholesale. Audit roles. Downgrade planner/router to cheap models before switching solver to 4.7
4.6 prompts carry implicit loose-interpretationRe-tune for literal instruction following before trusting 4.7 output
Context budget felt tight on 4.6Will feel tighter on 4.7. Use summarized handoffs + subagents + lower effort for non-hard steps
Production unattended fan-out currently uses --dangerously-skip-permissionsSwitch to auto mode (now available on Max) + isolated environment
No harness-level output constraints todayAdd schemas, length caps, validators before scaling to multi-agent
No independent-reviewer stepAdd one — /ultrareview or Writer/Reviewer with fresh context
"Strongest model everywhere" defaultRe-check Pareto frontier on 4.7 — tokenizer inflation shifts it

Two Underlying Principles#

Both failure modes specific to Opus on 4.6 — worst-planner-in-class and overthinking-on-short-answers — share one mechanism: scale-dependent overthinking. 4.7's literal instruction following may dampen it; 4.7's xhigh default and more-per-turn thinking cut the other way. Net direction is empirical.

The safe meta-rule for multi-agent coding teams: don't inherit 4.6 deployment decisions on faith. Measure on your workload, then decide. When to Use Claude Opus 4.6 for Work's five rules are the current best defaults — re-validate them as your multi-agent setup matures on 4.7.

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

6 articles link here
  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • ConceptClaude Code Auto Mode

    Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…

  • ConceptClaude Code Best Practices

    Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…

  • EntityClaude Opus 4.7

    GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…

  • ConceptClient-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…

  • ConceptScale-Dependent Prompt Sensitivity

    Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…

Related articles
  • ConceptClaude Code Best Practices

    Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…

  • ConceptLLM-Driven Vulnerability Research

    Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…

  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • EntityClaude Opus 4.7

    GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…

  • ConceptClient-Side Agent Optimization

    AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…