Planning / Execution Division of Labor

Sources#

Agentic coding and persistent returns to expertise

Summary#

Anthropic's 400K-session study supplies the empirical shape of human–agent collaboration in agentic coding: people decide what to build; the agent decides how. Measured by a privacy-preserving decision-attribution classifier, in a typical Claude Code session the user makes about 70% of the planning decisions (what to do, which approach, what counts as done) but only about 20% of the execution decisions (which files to change, what code to write, which commands to run). This is the clean, quantified version of the role-inversion the rest of the corpus describes qualitatively — coding stops being the human's job, the human becomes an allocator/director, thinking is delegated, understanding is retained.

Evidence note. empirical, with the same first-party caveat as Returns to Expertise in Agentic Coding: Anthropic measuring its own product via Clio + Sonnet-4.6 classifiers, validated against telemetry, excluding headless/SDK/IDE usage. Decision attribution is transcript-inferred.

Two lenses: decisions and actions#

The study separates who decides from how much gets delegated:

Decisions (content). The classifier lists every meaningful decision, splits it into planning vs execution, and attributes each to the user or Claude. Result: ~70% of planning is human, ~80% of execution is Claude's. A clean division of labor, not a blur.
Actions (structure). A session is a back-and-forth: the user prompts, Claude goes off and acts. A typical session is ~4 turns; each user prompt sets off a chain of ~10 Claude actions on average (reading files, editing code, running commands), writing ~2,400 words per turn. The tail is long — ~2% of sessions average >100 actions per prompt.

The two lenses lock together: how much Claude does between check-ins tracks who controls planning. When the user keeps execution control (>80% of execution decisions), Claude takes fewer actions per turn (~8). When Claude controls planning (>80% of planning decisions), it runs the longest chains (~16 actions). Delegating the plan is what lengthens the leash — and per Returns to Expertise in Agentic Coding, domain expertise is what lets a user safely hand over a longer one (novice ~5 → expert ~12 actions/prompt).

The tension with "AI as primary author"#

This is the most interesting cross-source juxtaposition in the wiki, because the two numbers look contradictory until you separate the units:

Faros: AI authors ~60% of accepted code, and the assistant→author threshold was crossed "without a deliberate decision."
This study: humans still make ~70% of planning decisions and ~80% of execution is Claude's.

They are not in conflict — they measure different things. Faros counts lines of code authored (an execution-layer metric); Anthropic counts decisions attributed (separating planning from execution). Reconciled: Claude writes most of the lines (execution) while humans still own most of the planning decisions. "AI is the author" and "humans decide what to build" are simultaneously true. The genuine open worry survives the reconciliation, though: Faros's "without a deliberate decision" and this study's 80%-execution-to-Claude both describe a quiet drift, and the rubber-stamping risk is whether nominal human planning control hollows out into approval-by-default.

Capability ceiling vs. realized autonomy#

The report is careful to distinguish what models can do from what users let them do. METR's time-horizon evaluations measure the ceiling — frontier models can now complete tasks that would take a person many hours, working through obstacles autonomously. The decision-attribution and actions-per-prompt measures here capture the realized division in actual sessions: even with a high and rising ceiling, the typical user keeps planning control and grants execution. The gap between ceiling and realized autonomy is itself a variable to watch — if planning increasingly shifts to Claude as the ceiling rises, that is the harness shrinking on the human-decision axis.

Connections#

Implementation Abundance Inverts Product Work — the process the division reorganizes: humans curate/decide, agents execute the abundant builds
Role Averaging, Not Role Elimination — the team-structure reorganization built on top of "humans decide what, agents decide how"
AI as Primary Author — the apparent contradiction (60% authorship vs 70% human planning) resolved by separating line-authorship from decision-attribution
Compute Allocator — "humans make the planning decisions" is exactly Thariq's allocator role: deciding what's worth doing while the model produces
Verification as the New Bottleneck — if the human owns planning + verification and Claude owns execution, the human's judgment throughput is the binding constraint
Returns to Expertise in Agentic Coding — expertise is what lets a user safely delegate planning and unlock the longer (16-action) chains
Task Time-Horizon Scaling — the capability ceiling (what models can do alone) vs. this study's realized autonomy (what users actually delegate)
Harness Shrinkage as Models Improve — the share of planning delegated to the agent is a usage-side reading of the shrinking harness
Outsource Your Thinking, Not Your Understanding — "decide what / agent decides how" is thinking outsourced, understanding (the planning) retained
Claude Code — the surface the division is measured on
Conversation-to-Delegation Shift — the cross-population realized-autonomy data: how much work users actually delegate to Codex (16.5% / 63.3% / 99.8% token share) is this division measured at adoption scale
Parallel Agent Orchestration — the human keeping the planning/coordination role while execution fans out across many concurrent agents — the division of labor at fleet scale

Open questions#

Does the human share of planning decisions fall over time as models improve (the ceiling rising into the planning layer), or is ~70% a stable human floor?
"Decision attribution" is inferred from transcripts. When Claude proposes a plan and the user assents, is that scored as the user's planning decision or Claude's? The rubber-stamping boundary is exactly where the measure is hardest.
Headless/SDK/pipeline usage (excluded here) is where execution autonomy is highest and planning is front-loaded into a single prompt — does the 70/20 split survive there, or collapse toward full delegation?

Sources#

Agentic coding and persistent returns to expertise — §"The division of labor", §"Who decides what"