Agentic Coding Work-Composition Shift

Sources#

Agentic coding and persistent returns to expertise

Summary#

The longitudinal finding of Anthropic's 400K-session study: over just seven months (Oct 2025 → Apr 2026), what people do with Claude Code shifted measurably away from firefighting and toward end-to-end agentic work — and the work got more valuable. The clearest single move: the share of sessions spent fixing broken code fell from 33% to 19% (the debugging share nearly halved). In its place grew the work around code — operating software 14%→21%, and writing + data analysis roughly doubling, ~10%→~20%. Meanwhile the estimated economic value of the average session rose ~25–27%. This is harness shrinkage seen from the usage side: as the model got more capable, users spent less time un-breaking things and more time delegating whole tasks.

Evidence note. empirical, same first-party caveat as Returns to Expertise in Agentic Coding (Anthropic measuring its own product via Clio + Sonnet-4.6 classifiers, validated against telemetry; excludes headless/SDK/IDE usage). Task value is a deliberately relative proxy — see below.

The nine work modes#

Every session is classified into the single mode that best describes it. The taxonomy itself is worth recording:

Code-direct (≈56%): building new (25%), fixing broken (26%), testing/orchestrating (5%)
Operating (17%): deploying, configuring, running pipelines, monitoring
Plan/explore (14%): understanding an existing system, planning a change
Analysis/prose (13%): analyzing data, communicating via docs/presentations

The classifier agrees with automatic telemetry — >90% of sessions it labeled as creating/modifying code showed actual code changes — which is the validation anchor for trusting the rest of the transcript-based measures.

The shift, quantified (Oct 2025 → Apr 2026)#

Work mode	Direction	Change
Fixing broken code	↓	33% → 19%
Operating software	↑	14% → 21%
Writing + data analysis	↑	~10% → ~20%

And value rose across the board (freelance-marketplace proxy): average session +~27%; building +43%, operating +34%, fixing +32%. The report stresses the dollar figures are coarse and best read as relative movement, not literal value.

The interpretation: less time is going to the reactive, low-leverage work (debugging) and more to proactive, higher-leverage, more autonomous work (operate-the-system, analyze-the-data, write-the-document). The report explicitly frames Claude Code usage as a possible preview of where knowledge work is headed as agents embed in non-coding work — the doubling of writing/analysis is the leading edge of that spread beyond code.

Telemetry, not survey — and the contrast it sets up#

Like Faros, this study reads behavior from system signals (here, session transcripts + automatic telemetry via Clio) rather than from how developers feel. But the two telemetry studies reach opposite-feeling conclusions, and the contrast is instructive:

This study: session value up ~27%, debugging down, success rates up — an optimistic picture at the level of the individual interactive session.
Faros: throughput up but quality down (bugs/dev +54%, incidents/PR +243%, review time 5×) — a pessimistic picture at the level of the organization's SDLC.

These are not a direct contradiction; they measure different units and stages. Anthropic measures did this session accomplish its goal and how valuable was it (a within-session, interactive-usage proxy, with the human in the loop); Faros measures what happens downstream across the whole org's pipeline weeks later (incidents, review queues, reopened tickets). One can be true at the session layer while the other is true at the org layer — "the task got done and felt valuable" and "the cumulative downstream cost is rising" are the felt-productivity-vs-system-outcome split that Telemetry vs. Survey Measurement itself flags. Evidence weight also differs: this is empirical research telemetry; Faros is vendor-claim lead-gen telemetry.

Connections#

Harness Shrinkage as Models Improve — the debugging-share collapse and the move to end-to-end delegation is the shrinking harness read from usage data: as the model improves, less scaffolding/firefighting per task
Acceleration Whiplash — the juxtaposed telemetry: value/success up at the session layer here vs. quality down at the org layer there; different units, both potentially true
Telemetry vs. Survey Measurement — both studies measure behavior not feeling; this is the empirical cousin to Faros's vendor-claim telemetry, and the session-vs-org distinction is the same felt-vs-system split
Returns to Expertise in Agentic Coding — the companion finding from the same study: who succeeds, alongside this what shifts
AI as Primary Author — more end-to-end agentic use (operate/analyze/write) is the usage-side of authorship moving toward the agent
Task Time-Horizon Scaling — the rising capability ceiling is the upstream cause: longer reliable task horizons are what let usage move from fixing to operating/analyzing whole workflows
Claude Code — the product whose usage composition this tracks
Conversation-to-Delegation Shift — the OpenAI/Codex, cross-population telemetry of the same asking→doing move; this page is its Anthropic/Claude-Code, single-tool sibling — two labs, two tools, one shift

Open questions#

The window is seven months and the value proxy is coarse/relative. How much of the +27% is genuine task-complexity growth vs. classifier/marketplace-matching drift?
The study excludes headless/SDK/IDE usage — a "substantial share," and likely the most automated/end-to-end. Does including it accelerate or reverse the composition shift?
If "fixing" keeps falling, is that because models break less, or because broken-code work is migrating to non-interactive pipelines this study doesn't see?

Sources#

Agentic coding and persistent returns to expertise — §"What people use Claude Code for", §"The work"