H
Howardism
Plate IIInteraction & MultimodalHOWARDISM

Why AI Lags at Design

PublishedJuly 3, 2026FiledConceptDomainInteraction & MultimodalTagsDesignModel CapabilityVerifiabilityInteraction MultimodalReading6 minSourceAI-synthesised

Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grade (no clean reward like 'does it compile'), it sat outside the AI-research flywheel labs optimized for, it rewards novelty where code rewards known patterns, and it hides a design↔code abstraction layer (a rebrand is 263 components on the surface, semantic relationships underneath)

Illustration for Why AI Lags at Design

Sources#

Summary#

Andrew Ambrosino (OpenAI Codex) answers a question the wiki keeps circling — why is "this looks like AI design" still a putdown while AI writes production code? — with four reasons frontier models trail at visual/product design, two practical (and fading) and two harder. It's a sharp, first-hand articulation of where the verifiable-reward frontier stops: code has a clean grader ("does it compile, does it do what it's supposed to"); design's grader is human taste, which is expensive to put in a training loop.

Evidence note. practitioner-opinion — an OpenAI product leader's read, explicitly hedged ("I'm not in our research… I'll get yelled at for saying this"), not a research claim.

The four reasons#

1. Design is hard to grade (the load-bearing one). "Creating a loop where you can train the model on what's good design and what's bad design is more tedious and onerous than 'does the code compile.'" Code carries its own verifier; design's verifier is the human aspect of taste, "part of the feedback mechanism you need." This is the verifiability thesis stated from the design side: capability advances fastest where reward is cheap and objective, and design's reward is neither.

2. It sat outside the AI-research flywheel. "Labs historically invest in making their models good at things that accelerate AI research." In the early coding-model era it was obvious that a model writing correct code would accelerate research; "you can't really make the same case for design." So design got less deliberate investment — not because it doesn't matter, but because it isn't in the self-improvement loop. (Practical; Ambrosino expects it to fade — "these models will get pretty good at design.")

3. Design rewards novelty; code rewards known patterns. "In software engineering you almost want it to over-index on known patterns." In design you don't: "there's an element of randomness and novelty." His example: for a year every new website was a copy of Linear's — "if a model outputs Linear's website every time, that's not the challenge here." Regression-to-the-mean is a feature for code and a failure for design. (Cf. Transformative Creativity / the abstraction-barrier critique: novel-concept generation may be a real ceiling, not just the next capability to fall.)

4. The design↔code abstraction layer (the deep one). Being a better visual designer isn't sufficient; there's "an interplay between the software design and the code being written" that is "visual design but significantly deeper — it's about the abstractions." His rebrand thought-experiment makes it concrete:

  • Shallow version: "we have to update 263 components one by one."
  • Deep version: understanding that "these two things look different but they're both in lists that have this style that conveys this interaction pattern to the user" — the semantic relationships between elements, not their pixels.

"That is still feeling a little bit out of reach with the current technology." This is the design-system-as-semantic-layer problem: real design competence lives in the maintainable abstraction between look and code, exactly the layer models are weakest on.

Why it matters#

If reasons 1–2 are practical and fading but 3–4 are structural, then design is a durable pocket of human taste longer than code was — the human "feedback mechanism" is not just labeling data, it's the reward function itself. It's also a caution against reading model-produced polish as competence: a model can emit a prod-looking surface (reason-3 mean-reversion to "good-looking") while missing the semantic abstraction (reason 4) that makes the design actually maintainable.

Connections#

Open Questions#

  • Are reasons 3–4 (novelty, the abstraction layer) genuine ceilings, or — like reasons 1–2 — just under-invested capabilities that fall once a lab builds the grader?
  • Can design be made gradable without a human in the loop (learned taste models, preference data at scale), or does the "human aspect of taste" resist automation the way research taste might?
  • Does the design↔code abstraction layer improve with better code-understanding models even if pure visual design stalls — i.e. is reason 4 a coding-capability problem in disguise?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 10
  • Andrew Ambrosino

    Product & engineering lead for the Codex desktop app at OpenAI; a designer→engineer→PM→founder generalist whose June 20…

  • Build for the Next Model

    Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…

  • Codex

    OpenAI's agentic coding and work platform: a CLI (April 2025) plus a desktop app (built Nov 2025, released Feb 2026) bu…

  • Jagged Intelligence (Ghosts, Not Animals)

    "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…

  • Living Design System

    `design_system.html` extracted from repos as a portable, human- and machine-readable source of truth; component playgro…

  • Interaction & Multimodal

    Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.

  • Polish No Longer Signals Readiness

    Andrew Ambrosino's observation that the medium used to encode process-stage — a production-looking artifact meant late-…

  • Research Taste as the Human Bottleneck

    The narrowing human role as AI absorbs execution: choosing which problems matter, which results to trust, and when an a…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • Transformative Creativity

    Boden's three-level model of creativity (combinational, exploratory, transformative) used to locate today's AI achievem…

Related articles