Plate IIAgent Systems中文HOWARDISM

Build for the Next Model

PublishedJune 7, 2026FiledConceptDomainAgent SystemsTagsAI Coding Workflow Product StrategyModel ImprovementReading10 minSourceAI-synthesised

Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not a far-future AGI) fixes what your engineering can't; Claude Design's Opus 4.7 payoff and OpenAI's 'the February Codex app would have failed in November' are the cleanest cases — same product shape, different-intelligence release, different outcome

Illustration for Build for the Next Model

Sources#

Summary#

A product-strategy corollary of Harness Shrinkage as Models Improve, now stated independently by three Anthropic voices: don't build the thing that already works — prototype the thing that almost works, and bet that the next model release closes the gap. Dan Carey gives it the cleanest case: Claude Design shipped with a list of problems the team "did not fix with clever engineering… we fixed them with Opus 4.7 coming out." Boris Cherny built Claude Code knowing "it wouldn't have PMF for 6 months because we were building for the next model." Cat Wu frames the discipline as "build products that don't necessarily work yet so that you know what is missing… and then with the newest model you can just swap it in." Because models improve rapidly, engineering effort spent forcing today's model to do what next quarter's model will do for free is wasted — "the model releases are a tide that lifts all boats."

The Carey statement (and why it's the clearest)#

"You do not want to work on the thing that already works. You often want to prototype the thing that almost works… The next model may just fix the issues that you cannot solve via engineering. We had this with Claude Design… We fixed them with Opus 4.7 coming out."

This is the rare retrospective, concrete confirmation of the bet: a named product (Claude Design), a named model (Claude Opus 4.7), and a specific outcome (unsolved prototype gaps closed by the release rather than by engineering). Boris and Cat state the strategy prospectively; Carey shows it paying off.

The crucial calibration: next model, not a strawman AGI#

The bet is easy to misread as "build for some imagined super-AI." Cat Wu guards exactly against that — her stance recorded on her entity page is "build for the current model": "It's very easy to build the product for the super-AGI strong model. The hard thing is figuring out for the current model, how do you elicit the maximum capability?" These reconcile into one rule:

Don't build for today's model only → you under-shoot, and ship something that's obsolete the moment the next release lands.
Don't build for a far-future AGI strawman → you over-shoot, and ship vaporware that depends on capability nobody has.
Build for the next concrete release (~the model ~6 months out) → you prototype "the thing that almost works," ship it as a research preview, and let the next release — which you can reasonably forecast — close the gap.

Carey names the target the prototype is reaching for: not completeness but "that hint of magic… something that could become [complete] in the future."

The OpenAI-side confirmation: same shape, different intelligence (Ambrosino)#

Andrew Ambrosino supplies the second concrete, retrospective case — and the sharpest formulation of the bet. His claim about the Codex app:

"I am very confident that the Codex app we released in February, if it had been ready in November, would have absolutely failed in the market — the only difference was the models between November and February. The exact same shape… its outcomes were totally different depending on just a few months of timing."

He generalizes it into a "same feature, different intelligence, re-release" pattern: Operator (in ChatGPT) → agent mode in Atlas → the in-app browser in Codex are "fundamentally the same feature," and "you might need to release this thing six different times before it works — the shape might not change at all." What changes is the model underneath. So he coaches his team not to be stubborn — "no, this isn't working, so it's a bad feature" is the wrong read; "it might not be ready yet" (the working build is an artifact to test against future models, not a shippable).

The over-shoot he warns about — "too AGI-pilled for the moment." Ambrosino names the failure mode on the other side of the calibration (the AGI-strawman the next section warns against) with an unusually candid cross-vendor example. The original Codex web release "gave the model a task and it went off and did it" — a fully-delegated, AGI-shaped form factor — but "the model didn't do the task that well." Meanwhile Claude Code came out "totally local, not hooked up to the cloud… doesn't pretend to be as AGI-pilled — it asks you questions, you can't just delegate your life to it," and it "worked way better because that's the point the models were at." His lesson: "we were too AGI-pilled for the moment." The bet is calibrated to where the models actually are, and matching the interaction shape to current capability can beat betting on delegation the model can't yet support. (This is the same product-fit gap Interaction Models describes from the interaction-design side.)

Planning under model uncertainty (the corollary)#

The same logic reshapes roadmaps. Ambrosino: "the shorter-term something is, the more detail it needs" — but a 9-month plan "has to stay very hazy, because any precision you add is false precision, and you're just going to waste time." Anything planned in November "may have been true for December but isn't what happened." So planning becomes forecasting model capability on a timeline: at his last company the process became "list everything we're interested in, prototype all of them, decide which are ready now, let the others sit and bake, and every time there's a new leap in models, try that thing again with it swapped out — because whether features were good was based on whether the model was smart enough, not the shape of them." That is planning-minimization driven specifically by capability uncertainty.

Why this follows from the bitter lesson#

This is the product-side expression of The Bitter Lesson and Harness Shrinkage as Models Improve: capability migrates into the model over releases, so scaffolding built to compensate for a current limitation is a depreciating asset. If a gap is the kind that scales away (reasoning, instruction-following, multimodal fidelity), patching it with engineering is building a crutch you'll soon delete. The discipline is to identify which gaps are "wait for the model" gaps versus which are durable harness work (Harness Shrinkage as Models Improve's caveat: mechanical verification, security, brand/character don't migrate inward).

The tension to hold#

"Prototype the thing that almost works" is in direct tension with Problem-Solution Fit Discipline's prototype-as-evidence trap: a fast prototype proves the build was tractable, not that the problem is real. The reconciliation: build-for-the-next-model is about capability risk (will the tech get there? — yes, wait for it), not market risk (does anyone want this? — the prototype doesn't answer that). You still validate demand through users; you just don't burn engineering forcing a capability the next model will hand you. Carey's own safeguard is that the bet rides on top of Compounding Loop Optimization and daily user contact — the "shape of the product" is validated continuously even while specific capability gaps are left for the model to close.

Connections#

Harness Shrinkage as Models Improve — the parent thesis; this is its product-strategy corollary, and that page's "Build for the next model" section points here
The Bitter Lesson — the root principle: capability migrates inward, so compensating scaffolding depreciates
Claude Opus 4.7 — the concrete model release that closed Claude Design's unsolved gaps
Claude Design — the case study product
Dan Carey — the retrospective statement; Boris Cherny and Cat Wu state it prospectively
Prototype Over PRD — how you author the "almost works" bet quickly
Compounding Loop Optimization — the loop that validates product shape while capability gaps wait for the model
Problem-Solution Fit Discipline — the counter-discipline: don't let "it almost works" become "the prototype validates the idea"
The Verifiability Thesis — what the next model reliably improves are verifiable-reward capabilities; gaps in non-verifiable taste are the riskier ones to "wait out"
Andrew Ambrosino / Codex — the OpenAI-side retrospective (Codex Feb-vs-Nov) and the "same feature, different intelligence, re-release six times" formulation
Polish No Longer Signals Readiness — "it might not be ready yet" relabels a working build as an artifact-to-test, not a shippable — the same stage-signal correction
Interaction Models — matching the interaction shape to current capability (Codex-web "too AGI-pilled" vs. Claude Code's local, question-asking form) is this bet made at the interaction-design layer
Why AI Lags at Design — design as a capability Ambrosino expects the next models to close, the archetypal "wait for the model" gap
Latent Capability Overhang — the pessimistic twin: "wait for the next model" (a capability's cost drops 10–100× per release, so mine it later and cheaper) is the flip side of "build for the next model"; both bet on the forecastable release cadence

Open Questions#

How do you tell a "wait for the model" gap from a durable-harness gap before the next release? Get it wrong and you either ship vaporware or build a crutch you'll delete.
The bet depends on a reliable release cadence and a forecastable capability curve (Task Time-Horizon Scaling). What happens to "build for the next model" if model improvement stalls (the stalled-but-diffused future)?
Does the strategy generalize outside frontier labs, who have privileged visibility into the next model? An external team is betting on a release it can't see.

Sources#

Designing with Claude: From prompt to production — Carey: "we fixed them with Opus 4.7 coming out"
Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next — Boris: "building for the next model"
How Anthropic's product team moves faster than anyone else | Cat Wu (Head of Product, Claude Code) — Cat: "build products that don't necessarily work yet"
OpenAI Codex lead on the new shape of product work — Ambrosino: "the February app would have failed in November — the only difference was the models"; "too AGI-pilled for the moment"

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 19

Andrew Ambrosino
Product & engineering lead for the Codex desktop app at OpenAI; a designer→engineer→PM→founder generalist whose June 20…
Anthropic Labs
Anthropic's internal incubator — a 'bet factory' of ~a dozen tiny teams exploring the model frontier with lean-startup…
Claude Design
Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — des…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Codex
OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Dan Carey
Product Manager leading product within Anthropic Labs; led Claude Design; 'Designing with Claude' talk (May 2026); ~two…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Latent Capability Overhang
Noam Brown's claim that already-released models can do far more than anyone has extracted, because nobody spends enough…
Agent Systems & Harness Engineering
Map of Content for the agent-systems domain — 23 concepts. Harness engineering, agent loops and orchestration, context…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Polish No Longer Signals Readiness
Andrew Ambrosino's observation that the medium used to encode process-stage — a production-looking artifact meant late-…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Problem-Solution Fit Discipline
Idea-stage thesis: three defenses against premature building (time, resources, belief friction) all eroded; AI as devil…
Prototype Over PRD
Dan Carey's prototype-replaces-PRD method: record a why-not-what conversation, transcribe it, hand the transcript to Cl…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Why AI Lags at Design
Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grad…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Design
Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — des…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Claude Design
Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — des…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…

Cited by 19

Andrew Ambrosino
Product & engineering lead for the Codex desktop app at OpenAI; a designer→engineer→PM→founder generalist whose June 20…
Anthropic Labs
Anthropic's internal incubator — a 'bet factory' of ~a dozen tiny teams exploring the model frontier with lean-startup…
Claude Design
Anthropic Labs product (research preview, ~April 2026) for collaborating with Claude on polished visual artifacts — des…
Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Codex
OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…
Compounding Loop Optimization
Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal t…
Dan Carey
Product Manager leading product within Anthropic Labs; led Claude Design; 'Designing with Claude' talk (May 2026); ~two…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Latent Capability Overhang
Noam Brown's claim that already-released models can do far more than anyone has extracted, because nobody spends enough…
Agent Systems & Harness Engineering
Map of Content for the agent-systems domain — 23 concepts. Harness engineering, agent loops and orchestration, context…
Open Questions Backlog
_396 actionable open questions across 155 pages · 79 predictions · 9 notes · 21 in progress · 59 watching (entities), a…
Polish No Longer Signals Readiness
Andrew Ambrosino's observation that the medium used to encode process-stage — a production-looking artifact meant late-…
The PRD-Replacement Spectrum at AI-Native Speed
Four positions (grill-then-PRD → lighter-PRD → build-to-decide → prototype-is-spec) are one spectrum once you decompose…
Problem-Solution Fit Discipline
Idea-stage thesis: three defenses against premature building (time, resources, belief friction) all eroded; AI as devil…
Prototype Over PRD
Dan Carey's prototype-replaces-PRD method: record a why-not-what conversation, transcribe it, hand the transcript to Cl…
Task Time-Horizon Scaling
METR's measure of the task length AI can complete reliably on its own, doubling roughly every 4 months (up from every 7…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Why AI Lags at Design
Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grad…