Sources#
Summary#
Andrej Karpathy's organizing claim for what AI automates and when: traditional computers automate what you can specify in code; LLMs automate what you can verify. Because frontier labs train models in giant reinforcement-learning environments with verification rewards, capability peaks sharply in verifiable domains (math, code) and stays rough elsewhere — producing Jagged Intelligence (Ghosts, Not Animals). The practical decomposition: a capability shows up if it is verifiable and the labs cared enough to build environments / include the data. Verifiability is therefore both an explanation of today's jaggedness and a strategy lever — if you can construct verification, you can pull the RL/fine-tuning lever yourself.
The core analogy#
Traditional computers can automate what you can specify in code. This latest round of LLMs can automate what you can verify.
RL training rewards verified outcomes, so the gradient flows hardest toward domains where correctness is checkable. Math and code are the canonical winners — and not coincidentally the domains where AI-Driven Formal Proof Search (Lean + compiler) and agentic coding (tests + CI) are most powerful. The compiler/test is the verifier; the verifier is the reward signal.
"Verifiable + labs care"#
Verifiability alone isn't sufficient — the labs also choose what enters the mix:
- The chess anecdote. GPT-3.5→GPT-4 chess improved far more than the general capability curve predicted, because "a huge amount of chess data made it into the pre-training set." Someone decided to add it; the capability spiked. You are "slightly at the mercy of whatever the labs happen to put into the mix."
- Implication for users. The model "has no manual." You must explore it: figure out which circuits you're in. "If you're in the circuits that were part of the RL, you fly; if you're out of the data distribution, you struggle" — and then you have to do your own fine-tuning.
The founder lever#
Karpathy's advice to founders chasing verifiable domains the labs haven't prioritized: verifiability is "technology that just works — you can pull a lever." If you can assemble diverse RL environments / examples, you can fine-tune and "get something that actually works pretty well." He coyly declines to name "one domain that's very [valuable]" — a deliberate non-answer that flags an unexploited verifiable niche as a startup opportunity. (Cross-ref the moats discussion in Seven Powers Applied to AI and Compounding Data Moat: a proprietary verification environment is a cornered resource.)
Everything is eventually verifiable#
On the flip side — "what stays automatable only from a distance?" — Karpathy argues almost everything can be made verifiable to some extent. Even soft domains like writing yield to "a council of LLM judges" producing something reasonable. So the question is how easy or hard, not whether. This is the optimistic horizon of the thesis: verifiability is a spectrum that AI keeps climbing, with LLM-judge ensembles extending the reward signal into fuzzy domains.
Connections#
- Andrej Karpathy — author of the thesis (his verifiability writing)
- Jagged Intelligence (Ghosts, Not Animals) — verifiability is the cause; jaggedness is the symptom
- AI-Driven Formal Proof Search — the purest instance: Lean's compiler is the perfect verifier, which is why DeepMind's agents resolve open math problems
- Vibe Coding vs. Agentic Engineering — the discipline's hiring test ("red-team can't break it") is verifiability operationalized
- Evals as Product Spec — Cat Wu's "ten great evals" is the product-side mirror: encoding what "verified/done" means for an AI feature
- Verification as the New Bottleneck — Fiona Fung: once coding is cheap, verification (not generation) is the scarce resource
- Scale-Dependent Prompt Sensitivity — verifiable-domain RL is part of why bigger models don't uniformly win on every benchmark
- The Bitter Lesson — RL-at-scale in verifiable environments is the general method outrunning hand-crafted heuristics
- Client-Side Agent Optimization — fine-tuning on your own RL environments is the heaviest "pull the lever" version of the optimization story
- Compounding Data Moat — a proprietary verification environment is a defensible cornered resource
Open Questions#
- Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
- The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?
Sources#
Cited by 13
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- AlphaProof Nexus
DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…
- Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
- Evals as Product Spec
Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…
- Evolutionary Proof Search
The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…
- Jagged Intelligence (Ghosts, Not Animals)
"Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…
- Lean
Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…
- Scale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
- The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- Verification as the New Bottleneck
Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…
- Vibe Coding vs. Agentic Engineering
Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…
Related articles
- Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- Claude Code
Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…
- AI-Driven Formal Proof Search
LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…
- Andrej Karpathy
Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…
- Agentic Loops Overtake Bespoke Systems
DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…
