H
Howardism
Plate IIAI EngineeringHOWARDISM

Agentic Coding Work-Composition Shift

PublishedJune 17, 2026FiledConceptDomainAI EngineeringTagsAI Coding WorkflowAgent EngineeringEngineering MetricsEmpiricalAnthropicReading5 minSourceAI-synthesised

Anthropic's 400K-session telemetry, Oct 2025→Apr 2026: as models improved, the share of sessions fixing broken code fell 33%→19% (debugging nearly halved), while operating software (14%→21%) and writing+data-analysis (~10%→~20%) grew; estimated task value rose ~25–27% — usage moving from firefighting toward end-to-end agentic work

Illustration for Agentic Coding Work-Composition Shift

Sources#

Summary#

The longitudinal finding of Anthropic's 400K-session study: over just seven months (Oct 2025 → Apr 2026), what people do with Claude Code shifted measurably away from firefighting and toward end-to-end agentic work — and the work got more valuable. The clearest single move: the share of sessions spent fixing broken code fell from 33% to 19% (the debugging share nearly halved). In its place grew the work around code — operating software 14%→21%, and writing + data analysis roughly doubling, ~10%→~20%. Meanwhile the estimated economic value of the average session rose ~25–27%. This is harness shrinkage seen from the usage side: as the model got more capable, users spent less time un-breaking things and more time delegating whole tasks.

Evidence note. empirical, same first-party caveat as Returns to Expertise in Agentic Coding (Anthropic measuring its own product via Clio + Sonnet-4.6 classifiers, validated against telemetry; excludes headless/SDK/IDE usage). Task value is a deliberately relative proxy — see below.

The nine work modes#

Every session is classified into the single mode that best describes it. The taxonomy itself is worth recording:

  • Code-direct (≈56%): building new (25%), fixing broken (26%), testing/orchestrating (5%)
  • Operating (17%): deploying, configuring, running pipelines, monitoring
  • Plan/explore (14%): understanding an existing system, planning a change
  • Analysis/prose (13%): analyzing data, communicating via docs/presentations

The classifier agrees with automatic telemetry — >90% of sessions it labeled as creating/modifying code showed actual code changes — which is the validation anchor for trusting the rest of the transcript-based measures.

The shift, quantified (Oct 2025 → Apr 2026)#

Work modeDirectionChange
Fixing broken code33% → 19%
Operating software14% → 21%
Writing + data analysis~10% → ~20%

And value rose across the board (freelance-marketplace proxy): average session +~27%; building +43%, operating +34%, fixing +32%. The report stresses the dollar figures are coarse and best read as relative movement, not literal value.

The interpretation: less time is going to the reactive, low-leverage work (debugging) and more to proactive, higher-leverage, more autonomous work (operate-the-system, analyze-the-data, write-the-document). The report explicitly frames Claude Code usage as a possible preview of where knowledge work is headed as agents embed in non-coding work — the doubling of writing/analysis is the leading edge of that spread beyond code.

Telemetry, not survey — and the contrast it sets up#

Like Faros, this study reads behavior from system signals (here, session transcripts + automatic telemetry via Clio) rather than from how developers feel. But the two telemetry studies reach opposite-feeling conclusions, and the contrast is instructive:

  • This study: session value up ~27%, debugging down, success rates up — an optimistic picture at the level of the individual interactive session.
  • Faros: throughput up but quality down (bugs/dev +54%, incidents/PR +243%, review time 5×) — a pessimistic picture at the level of the organization's SDLC.

These are not a direct contradiction; they measure different units and stages. Anthropic measures did this session accomplish its goal and how valuable was it (a within-session, interactive-usage proxy, with the human in the loop); Faros measures what happens downstream across the whole org's pipeline weeks later (incidents, review queues, reopened tickets). One can be true at the session layer while the other is true at the org layer — "the task got done and felt valuable" and "the cumulative downstream cost is rising" are the felt-productivity-vs-system-outcome split that Telemetry vs. Survey Measurement itself flags. Evidence weight also differs: this is empirical research telemetry; Faros is vendor-claim lead-gen telemetry.

Connections#

  • Harness Shrinkage as Models Improve — the debugging-share collapse and the move to end-to-end delegation is the shrinking harness read from usage data: as the model improves, less scaffolding/firefighting per task
  • Acceleration Whiplash — the juxtaposed telemetry: value/success up at the session layer here vs. quality down at the org layer there; different units, both potentially true
  • Telemetry vs. Survey Measurement — both studies measure behavior not feeling; this is the empirical cousin to Faros's vendor-claim telemetry, and the session-vs-org distinction is the same felt-vs-system split
  • Returns to Expertise in Agentic Coding — the companion finding from the same study: who succeeds, alongside this what shifts
  • AI as Primary Author — more end-to-end agentic use (operate/analyze/write) is the usage-side of authorship moving toward the agent
  • Task Time-Horizon Scaling — the rising capability ceiling is the upstream cause: longer reliable task horizons are what let usage move from fixing to operating/analyzing whole workflows
  • Claude Code — the product whose usage composition this tracks
  • Conversation-to-Delegation Shift — the OpenAI/Codex, cross-population telemetry of the same asking→doing move; this page is its Anthropic/Claude-Code, single-tool sibling — two labs, two tools, one shift

Open questions#

  • The window is seven months and the value proxy is coarse/relative. How much of the +27% is genuine task-complexity growth vs. classifier/marketplace-matching drift?
  • The study excludes headless/SDK/IDE usage — a "substantial share," and likely the most automated/end-to-end. Does including it accelerate or reverse the composition shift?
  • If "fixing" keeps falling, is that because models break less, or because broken-code work is migrating to non-interactive pipelines this study doesn't see?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 11
  • Acceleration Whiplash

    Faros 2026: AI floods a human-paced SDLC with output it can't absorb — throughput up (tasks +34%, epics +66%), quality…

  • AI as Primary Author

    Faros 2026: the assistant→author threshold crossed without a deliberate decision, marked by AI-code acceptance rising 2…

  • Anthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • Anthropic Economic Index

    Anthropic's recurring economic-research program measuring how Claude usage maps to and diffuses through the economy — p…

  • Claude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…

  • Conversation-to-Delegation Shift

    OpenAI's Codex usage study (June 2026): the move from conversational AI ('asking') to agentic AI ('delegated production…

  • AI Engineering & Agent Tooling

    Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.

  • Open Questions Backlog

    _124 pages with open questions, as of 2026-06-19._

  • Returns to Expertise in Agentic Coding

    Anthropic's 400K-session study: domain expertise (not coding skill) is what amplifies an agent — experts get 2× the act…

  • Task Time-Horizon Scaling

    METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…

  • Telemetry vs. Survey Measurement

    Faros 2026: perception lags reality, so survey-based engineering research (DORA) misses downstream AI damage that syste…

Related articles