H
Howardism
Plate IIAI EngineeringHOWARDISM

Planning / Execution Division of Labor

PublishedJune 17, 2026FiledConceptDomainAI EngineeringTagsAI Coding WorkflowAgent EngineeringHuman AI CollaborationEmpiricalAnthropicReading6 minSourceAI-synthesised

Anthropic's 400K-session telemetry: in a typical Claude Code session humans make ~70% of planning decisions (what to do) while Claude makes ~80% of execution decisions (how to do it); each prompt sets off ~10 actions (8 when the user keeps execution control, ~16 when Claude controls planning) — 'people decide what to build, the agent decides how'

Illustration for Planning / Execution Division of Labor

Sources#

Summary#

Anthropic's 400K-session study supplies the empirical shape of human–agent collaboration in agentic coding: people decide what to build; the agent decides how. Measured by a privacy-preserving decision-attribution classifier, in a typical Claude Code session the user makes about 70% of the planning decisions (what to do, which approach, what counts as done) but only about 20% of the execution decisions (which files to change, what code to write, which commands to run). This is the clean, quantified version of the role-inversion the rest of the corpus describes qualitatively — coding stops being the human's job, the human becomes an allocator/director, thinking is delegated, understanding is retained.

Evidence note. empirical, with the same first-party caveat as Returns to Expertise in Agentic Coding: Anthropic measuring its own product via Clio + Sonnet-4.6 classifiers, validated against telemetry, excluding headless/SDK/IDE usage. Decision attribution is transcript-inferred.

Two lenses: decisions and actions#

The study separates who decides from how much gets delegated:

  • Decisions (content). The classifier lists every meaningful decision, splits it into planning vs execution, and attributes each to the user or Claude. Result: ~70% of planning is human, ~80% of execution is Claude's. A clean division of labor, not a blur.
  • Actions (structure). A session is a back-and-forth: the user prompts, Claude goes off and acts. A typical session is ~4 turns; each user prompt sets off a chain of ~10 Claude actions on average (reading files, editing code, running commands), writing ~2,400 words per turn. The tail is long — ~2% of sessions average >100 actions per prompt.

The two lenses lock together: how much Claude does between check-ins tracks who controls planning. When the user keeps execution control (>80% of execution decisions), Claude takes fewer actions per turn (~8). When Claude controls planning (>80% of planning decisions), it runs the longest chains (~16 actions). Delegating the plan is what lengthens the leash — and per Returns to Expertise in Agentic Coding, domain expertise is what lets a user safely hand over a longer one (novice ~5 → expert ~12 actions/prompt).

The tension with "AI as primary author"#

This is the most interesting cross-source juxtaposition in the wiki, because the two numbers look contradictory until you separate the units:

  • Faros: AI authors ~60% of accepted code, and the assistant→author threshold was crossed "without a deliberate decision."
  • This study: humans still make ~70% of planning decisions and ~80% of execution is Claude's.

They are not in conflict — they measure different things. Faros counts lines of code authored (an execution-layer metric); Anthropic counts decisions attributed (separating planning from execution). Reconciled: Claude writes most of the lines (execution) while humans still own most of the planning decisions. "AI is the author" and "humans decide what to build" are simultaneously true. The genuine open worry survives the reconciliation, though: Faros's "without a deliberate decision" and this study's 80%-execution-to-Claude both describe a quiet drift, and the rubber-stamping risk is whether nominal human planning control hollows out into approval-by-default.

Capability ceiling vs. realized autonomy#

The report is careful to distinguish what models can do from what users let them do. METR's time-horizon evaluations measure the ceiling — frontier models can now complete tasks that would take a person many hours, working through obstacles autonomously. The decision-attribution and actions-per-prompt measures here capture the realized division in actual sessions: even with a high and rising ceiling, the typical user keeps planning control and grants execution. The gap between ceiling and realized autonomy is itself a variable to watch — if planning increasingly shifts to Claude as the ceiling rises, that is the harness shrinking on the human-decision axis.

Connections#

Open questions#

  • Does the human share of planning decisions fall over time as models improve (the ceiling rising into the planning layer), or is ~70% a stable human floor?
  • "Decision attribution" is inferred from transcripts. When Claude proposes a plan and the user assents, is that scored as the user's planning decision or Claude's? The rubber-stamping boundary is exactly where the measure is hardest.
  • Headless/SDK/pipeline usage (excluded here) is where execution autonomy is highest and planning is front-loaded into a single prompt — does the 70/20 split survive there, or collapse toward full delegation?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 14
  • AI as Primary Author

    Faros 2026: the assistant→author threshold crossed without a deliberate decision, marked by AI-code acceptance rising 2…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Claude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…

  • Compute Allocator

    The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…

  • Conversation-to-Delegation Shift

    OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…

  • Implementation Abundance Inverts Product Work

    Andrew Ambrosino's inversion thesis: when talking to a frontier model can stand up any feature from scratch, implementa…

  • AI Engineering & Agent Tooling

    Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Outsource Your Thinking, Not Your Understanding

    "You can outsource your thinking but not your understanding"; understanding as the non-delegable human bottleneck; know…

  • Parallel Agent Orchestration

    OpenAI Codex study's concurrency + runtime margins: the intensive-user workflow where a human oversees a team of agents…

  • Research Taste as the Human Bottleneck

    The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…

  • Returns to Expertise in Agentic Coding

    Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…

  • Role Averaging, Not Role Elimination

    Andrew Ambrosino's nuanced OpenAI-side take on role collapse: your role is 'the average of what you spend your time on'…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

Related articles
  • Engineer PM Convergence

    Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do t…

  • Compute Allocator

    The human's evolving role: deciding what's worth spending compute on; ~1% of generated tokens ship, 99% is scaffolding…

  • Returns to Expertise in Agentic Coding

    Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…

  • Agentic Coding Work-Composition Shift

    Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fel…

  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…