Plate IIAI Engineering中文HOWARDISM

The Verifiability Thesis

PublishedMay 23, 2026FiledConceptDomainAI EngineeringTagsLLM Architecture LLM Evaluation Agent EngineeringReading5 minSourceAI-synthesised

LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable

Illustration for The Verifiability Thesis

Sources#

Andrej Karpathy: From Vibe Coding to Agentic Engineering

Summary#

Andrej Karpathy's organizing claim for what AI automates and when: traditional computers automate what you can specify in code; LLMs automate what you can verify. Because frontier labs train models in giant reinforcement-learning environments with verification rewards, capability peaks sharply in verifiable domains (math, code) and stays rough elsewhere — producing Jagged Intelligence (Ghosts, Not Animals). The practical decomposition: a capability shows up if it is verifiable and the labs cared enough to build environments / include the data. Verifiability is therefore both an explanation of today's jaggedness and a strategy lever — if you can construct verification, you can pull the RL/fine-tuning lever yourself.

The core analogy#

Traditional computers can automate what you can specify in code. This latest round of LLMs can automate what you can verify.

RL training rewards verified outcomes, so the gradient flows hardest toward domains where correctness is checkable. Math and code are the canonical winners — and not coincidentally the domains where AI-Driven Formal Proof Search (Lean + compiler) and agentic coding (tests + CI) are most powerful. The compiler/test is the verifier; the verifier is the reward signal.

"Verifiable + labs care"#

Verifiability alone isn't sufficient — the labs also choose what enters the mix:

The chess anecdote. GPT-3.5→GPT-4 chess improved far more than the general capability curve predicted, because "a huge amount of chess data made it into the pre-training set." Someone decided to add it; the capability spiked. You are "slightly at the mercy of whatever the labs happen to put into the mix."
Implication for users. The model "has no manual." You must explore it: figure out which circuits you're in. "If you're in the circuits that were part of the RL, you fly; if you're out of the data distribution, you struggle" — and then you have to do your own fine-tuning.

The founder lever#

Karpathy's advice to founders chasing verifiable domains the labs haven't prioritized: verifiability is "technology that just works — you can pull a lever." If you can assemble diverse RL environments / examples, you can fine-tune and "get something that actually works pretty well." He coyly declines to name "one domain that's very [valuable]" — a deliberate non-answer that flags an unexploited verifiable niche as a startup opportunity. (Cross-ref the moats discussion in Seven Powers Applied to AI and Compounding Data Moat: a proprietary verification environment is a cornered resource.)

Everything is eventually verifiable#

On the flip side — "what stays automatable only from a distance?" — Karpathy argues almost everything can be made verifiable to some extent. Even soft domains like writing yield to "a council of LLM judges" producing something reasonable. So the question is how easy or hard, not whether. This is the optimistic horizon of the thesis: verifiability is a spectrum that AI keeps climbing, with LLM-judge ensembles extending the reward signal into fuzzy domains.

Connections#

Andrej Karpathy — author of the thesis (his verifiability writing)
Jagged Intelligence (Ghosts, Not Animals) — verifiability is the cause; jaggedness is the symptom
AI-Driven Formal Proof Search — the purest instance: Lean's compiler is the perfect verifier, which is why DeepMind's agents resolve open math problems
Vibe Coding vs. Agentic Engineering — the discipline's hiring test ("red-team can't break it") is verifiability operationalized
Evals as Product Spec — Cat Wu's "ten great evals" is the product-side mirror: encoding what "verified/done" means for an AI feature
Verification as the New Bottleneck — Fiona Fung: once coding is cheap, verification (not generation) is the scarce resource
Scale-Dependent Prompt Sensitivity — verifiable-domain RL is part of why bigger models don't uniformly win on every benchmark
The Bitter Lesson — RL-at-scale in verifiable environments is the general method outrunning hand-crafted heuristics
Client-Side Agent Optimization — fine-tuning on your own RL environments is the heaviest "pull the lever" version of the optimization story
Compounding Data Moat — a proprietary verification environment is a defensible cornered resource

Open Questions#

Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?

Sources#

Andrej Karpathy: From Vibe Coding to Agentic Engineering

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 24

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Reward Hacking
The model optimizing the measured proxy (a reward signal, a metric, a grader's judgment, a tool's output) rather than t…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Why AI Lags at Design
Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grad…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

Cited by 24

Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
Autonomous Scientific Discovery
Mythos-class models now conduct novel science with limited human input — autonomous protein/drug design (~10× faster, m…
Build for the Next Model
Prototype the thing that almost works, not the thing that already works: bet that the next concrete model release (not…
Client-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
Compounding Data Moat
Anthropic's prescription for Scale-stage defensibility: time-locked behavioral fingerprint + domain-encoded edge cases…
Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
How Do You Write Evals for Taste? Character as the Limit Case
Taste-driven features are eval-resistant but not eval-proof: the technique is conviction → dogfood-sourced failure sign…
Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
AI Engineering & Agent Tooling
Map of Content for the ai-engineering domain — 45 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_124 pages with open questions, as of 2026-06-19._
Reward Hacking
The model optimizing the measured proxy (a reward signal, a metric, a grader's judgment, a tool's output) rather than t…
Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
Seven Powers Applied to AI
Helmer/Acquired framework re-evaluated for AI: switching costs and process power erode; network effects, scale, cornere…
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
When Does Verification Quality Determine Whether AI Automation Works?
Verification-quality ladder from Lean/formal proof search through software CI and vulnerability reproduction; autonomy…
Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Why AI Lags at Design
Andrew Ambrosino's four reasons frontier models are worse at visual/product design than at code: design is hard to grad…