Howardism

Sources#

Question#

What's the difference between Opus 4.6 & 4.7, and what should I pay attention to when working in a multi-team agent setup for coding?

Part 1 — What Changed Between 4.6 and 4.7#

Same price ($5/M input, $25/M output), same product position, new API ID (claude-opus-4-7). Direct upgrade. Details and nuances from Claude Opus 4.7:

Axis	4.7 vs. 4.6	Why it matters for a multi-agent setup
Hardest coding tasks	Marketed for "hand off your hardest work"; SOTA Finance Agent and GDPval-AA	Raises the per-role ceiling where Opus is the right choice
Instruction following	Literal. Skips/loosens less.	Prompts and CLAUDE.md hedges written for 4.6 may misbehave — biggest migration hazard
Vision	Up to 2,576 px long edge (~3.75 MP, >3× prior)	Computer-use agents reading dense screenshots improve
File-system memory	Stronger across multi-session work	Favors repo-local versioned artifacts as shared memory between agents
Safety	Better on prompt-injection and honesty; slightly weaker on harm-reduction over-elaboration	Prompt-injection resistance specifically matters when one agent consumes another's output
Cyber capability	Differentially reduced during training + request-level classifier (post-Glasswing)	Legitimate security automation now routes through Cyber Verification Program
Effort levels	New `xhigh` between `high` and `max`. Claude Code default raised to `xhigh` on all plans	Per-agent token cost at default increased
Tokenizer	Same input → 1.0–1.35× more tokens	Context budget shrinks per agent unless prompts are re-measured
Per-turn thinking	Thinks more at higher effort, especially on later agentic turns	Compounds tokenizer inflation across multi-turn orchestration
Accompanying launches	Task budgets (API public beta), `/ultrareview`, auto mode extended to Max	Server-side budget lever arrives alongside auto mode for unattended runs

Compounded token-budget hit: tokenizer inflation × xhigh default × more per-turn thinking. The naive "drop-in upgrade" consumes more context than 4.6 on the same prompts. Anthropic explicitly recommends measuring on real traffic rather than trusting the net-favorable claim on their internal coding eval.

Part 2 — What to Pay Attention to in a Multi-Agent Coding Setup#

The research in this wiki points at six concrete hazards when moving from a single-agent workflow to a multi-agent team on Opus 4.7.

1. Role-Based Model Selection, Not Strongest-Model Default#

From Client-Side Agent Optimization (AgentOpt, Hua et al., 2026) measured on Opus 4.6:

Combination (planner + solver on HotpotQA)	Accuracy
Ministral 3 8B + Opus	74.27%
Opus + Opus	31.71%

Opus-as-planner bypasses the downstream solver's search tools and answers from parametric knowledge. The combo is the right unit of analysis, not per-role accuracy in isolation.

Rule:

Put a smaller, more obedient model in planner / router / file-ranker / decomposer seats.
Reserve Opus 4.7 for synthesis, integrative reasoning, final-answer generation, and code-review-over-multiple-files roles.
Before committing Opus to a role, check whether a cheaper model matches accuracy. On BFCL, Qwen3 Next 80B matched Opus 4.6 at 32× lower cost. 4.7's tokenizer inflation makes this gap worse, not better.

Open question on 4.7: literal instruction following may narrow the planner gap. Don't assume it — re-measure. See When to Use Claude Opus 4.6 for Work addendum for the per-rule 4.7 predictions.

2. Re-Tune Prompts From the 4.6 Era#

Opus 4.7's literal instruction following is the single biggest migration hazard for existing multi-agent orchestration code. Prompts that worked on 4.6 because it loosely interpreted "or similar," "prefer X," "try to," or skipped apparently-optional steps may now behave rigidly.

Audit list:

CLAUDE.md files and system prompts containing hedges
Multi-agent role cards that relied on the model "knowing when to delegate"
Chained prompts where a later step assumed an earlier step would be skipped
Prompts that assumed Opus would correct ambiguous instructions rather than execute them as written

3. Harness-Level Invariants, Not Prompt Advisories#

From Agent Harness Engineering and Scale-Dependent Prompt Sensitivity:

Enforce output constraints mechanically — structured-output schemas, length caps, response validators, /ultrareview-style reviewer passes.
Brevity constraints recovered +26.3pp on overthinking-prone problems on 4.6. 4.7's literal instruction following may make these more effective (the model obeys word caps). Use them.
Diagnostic exception: BoolQ and similar cross-sentence integration tasks — brevity hurts. Don't cap reasoning-product outputs.

Large-model verbosity stacks in a multi-agent pipeline: each agent's output becomes another agent's context. Errors compound, context fills, reasoning degrades.

4. Context Budget Management Is Per-Agent#

Claude Code Best Practices applies to each agent in isolation. With 4.7 default xhigh + tokenizer inflation + more per-turn thinking, multi-agent handoffs consume budget faster. Tactics:

Summarized handoffs, not full-context passthrough
Subagents in isolated context (Claude Code pattern): investigate separately, return a summary
Writer/Reviewer pattern with fresh context on the reviewer side — especially relevant now that /ultrareview is a dedicated 4.7 primitive
Lower effort for non-hard steps. Anthropic recommends high or xhigh for coding/agentic; max is rarely worth it
Task budgets (API public beta): per-phase spend caps as a server-side analogue to AgentOpt's budget lever

5. Unattended Fan-Out Safety#

From Claude Code Auto Mode (now extended to Max users alongside 4.7):

Auto mode is strictly safer than --dangerously-skip-permissions for multi-agent fan-out — a classifier pre-flights every tool call and redirects Claude on risky actions.
It is not a substitute for isolated environments. Anthropic documents two classifier failure modes: ambiguous intent and missing environment context.
In non-interactive mode, auto mode aborts on repeated blocks rather than hanging on an un-answerable prompt — preserves the fan-out use case.

When an agent fans out commands with destructive shape (migrations, deletes, deploys) across a team: layer auto mode inside a sandboxed container or worktree. Defense in depth.

6. Agent-Review Pattern Beats Self-Verification#

From Claude Code Best Practices: the Writer/Reviewer pattern — one agent implements, a second reviews with fresh context — reduces the "own-code blind spot." With Opus 4.7:

/ultrareview is the vendored form of this pattern; Pro and Max users get 3 free ultrareviews for evaluation
File system memory improvements mean the reviewer can read the writer's progress log and git history efficiently, not just the diff
For unattended multi-team workflows: route reviewer-agent output back through a validator agent (mirrors the final-validation-agent pattern in LLM-Driven Vulnerability Research) rather than trusting single-agent reviews

Decision Summary#

Situation	Action
Multi-agent team currently on Opus 4.6 all-Opus pipeline	Don't migrate wholesale. Audit roles. Downgrade planner/router to cheap models before switching solver to 4.7
4.6 prompts carry implicit loose-interpretation	Re-tune for literal instruction following before trusting 4.7 output
Context budget felt tight on 4.6	Will feel tighter on 4.7. Use summarized handoffs + subagents + lower effort for non-hard steps
Production unattended fan-out currently uses `--dangerously-skip-permissions`	Switch to auto mode (now available on Max) + isolated environment
No harness-level output constraints today	Add schemas, length caps, validators before scaling to multi-agent
No independent-reviewer step	Add one — `/ultrareview` or Writer/Reviewer with fresh context
"Strongest model everywhere" default	Re-check Pareto frontier on 4.7 — tokenizer inflation shifts it

Two Underlying Principles#

Both failure modes specific to Opus on 4.6 — worst-planner-in-class and overthinking-on-short-answers — share one mechanism: scale-dependent overthinking. 4.7's literal instruction following may dampen it; 4.7's xhigh default and more-per-turn thinking cut the other way. Net direction is empirical.

The safe meta-rule for multi-agent coding teams: don't inherit 4.6 deployment decisions on faith. Measure on your workload, then decide. When to Use Claude Opus 4.6 for Work's five rules are the current best defaults — re-validate them as your multi-agent setup matures on 4.7.

Sources#

Claude Opus 4.7 — 4.7 capability and token-economics deltas
Claude Code Best Practices — context-budget constraint, subagents, Writer/Reviewer, session management, scaling patterns
Claude Code Auto Mode — classifier-gated permission middle-ground
Client-Side Agent Optimization — combo selection, Opus-as-planner failure mode, Pareto frontier
Scale-Dependent Prompt Sensitivity — brevity constraints, overthinking mechanism, BoolQ exception
Agent Harness Engineering — enforce invariants mechanically, progressive disclosure, doc gardening
LLM-Driven Vulnerability Research — final-validation-agent pattern as template for multi-agent review
When to Use Claude Opus 4.6 for Work — per-rule 4.7 predictions in addendum
Introducing Claude Opus 4.7
Best Practices for Claude Code
Auto mode for Claude Code
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
Brevity Constraints Reverse Performance Hierarchies in Language Models

Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations

Sources#

Question#

Part 1 — What Changed Between 4.6 and 4.7#

Part 2 — What to Pay Attention to in a Multi-Agent Coding Setup#

1. Role-Based Model Selection, Not Strongest-Model Default#

2. Re-Tune Prompts From the 4.6 Era#

3. Harness-Level Invariants, Not Prompt Advisories#

4. Context Budget Management Is Per-Agent#

5. Unattended Fan-Out Safety#

6. Agent-Review Pattern Beats Self-Verification#

Decision Summary#

Two Underlying Principles#

Sources#