H
Howardismvol. 03 · quiet corner of the web
Plate IIArchitectureHOWARDISM

The Verifiability Thesis

PublishedMay 23, 2026FiledConceptTopicArchitectureTagsLLM ArchitectureLLM EvaluationAgent EngineeringReading5 minSourceAI-synthesised

LLMs automate what you can *verify* as computers automate what you can *specify*; RL verification rewards → jagged peaks; "verifiable + labs care"; everything eventually verifiable

Illustration for The Verifiability Thesis

Sources#

Summary#

Andrej Karpathy's organizing claim for what AI automates and when: traditional computers automate what you can specify in code; LLMs automate what you can verify. Because frontier labs train models in giant reinforcement-learning environments with verification rewards, capability peaks sharply in verifiable domains (math, code) and stays rough elsewhere — producing Jagged Intelligence (Ghosts, Not Animals). The practical decomposition: a capability shows up if it is verifiable and the labs cared enough to build environments / include the data. Verifiability is therefore both an explanation of today's jaggedness and a strategy lever — if you can construct verification, you can pull the RL/fine-tuning lever yourself.

The core analogy#

Traditional computers can automate what you can specify in code. This latest round of LLMs can automate what you can verify.

RL training rewards verified outcomes, so the gradient flows hardest toward domains where correctness is checkable. Math and code are the canonical winners — and not coincidentally the domains where AI-Driven Formal Proof Search (Lean + compiler) and agentic coding (tests + CI) are most powerful. The compiler/test is the verifier; the verifier is the reward signal.

"Verifiable + labs care"#

Verifiability alone isn't sufficient — the labs also choose what enters the mix:

  • The chess anecdote. GPT-3.5→GPT-4 chess improved far more than the general capability curve predicted, because "a huge amount of chess data made it into the pre-training set." Someone decided to add it; the capability spiked. You are "slightly at the mercy of whatever the labs happen to put into the mix."
  • Implication for users. The model "has no manual." You must explore it: figure out which circuits you're in. "If you're in the circuits that were part of the RL, you fly; if you're out of the data distribution, you struggle" — and then you have to do your own fine-tuning.

The founder lever#

Karpathy's advice to founders chasing verifiable domains the labs haven't prioritized: verifiability is "technology that just works — you can pull a lever." If you can assemble diverse RL environments / examples, you can fine-tune and "get something that actually works pretty well." He coyly declines to name "one domain that's very [valuable]" — a deliberate non-answer that flags an unexploited verifiable niche as a startup opportunity. (Cross-ref the moats discussion in Seven Powers Applied to AI and Compounding Data Moat: a proprietary verification environment is a cornered resource.)

Everything is eventually verifiable#

On the flip side — "what stays automatable only from a distance?" — Karpathy argues almost everything can be made verifiable to some extent. Even soft domains like writing yield to "a council of LLM judges" producing something reasonable. So the question is how easy or hard, not whether. This is the optimistic horizon of the thesis: verifiability is a spectrum that AI keeps climbing, with LLM-judge ensembles extending the reward signal into fuzzy domains.

Connections#

Open Questions#

  • Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
  • The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 13
  • Agentic Loops Overtake Bespoke Systems

    DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • AlphaProof Nexus

    DeepMind framework for LLM-aided Lean proof generation; four agents (basic→full-featured); proof-sketch + EVOLVE-BLOCK…

  • Andrej Karpathy

    Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…

  • Evals as Product Spec

    Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done loo…

  • Evolutionary Proof Search

    The full-featured agent's mechanism: population DB of proof sketches, Elo via Plackett–Luce/Gibbs, P-UCB selection, LLM…

  • Jagged Intelligence (Ghosts, Not Animals)

    "Ghosts not animals": jagged statistical circuits, no intrinsic motivation; car-wash/strawberry failures; stay in the l…

  • Lean

    Proof assistant whose compiler mechanically verifies every step; the `sorry` placeholder enables proof sketches; mathli…

  • Scale-Dependent Prompt Sensitivity

    Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…

  • Software 3.0

    Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…

  • The Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • Verification as the New Bottleneck

    Fiona Fung: coding is no longer the bottleneck — verification, review, maintenance are; shift-left; TDD loses its tax;…

  • Vibe Coding vs. Agentic Engineering

    Vibe coding raises the floor (anyone builds); agentic engineering preserves the quality bar while going faster; ">10x a…

Related articles
  • Harness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • Claude Code

    Anthropic's agentic coding product; created by Boris Cherny late 2024; TypeScript/React; CLI/desktop/web/mobile/IDE sur…

  • AI-Driven Formal Proof Search

    LLM generates Lean, compiler verifies every step → eliminates hallucination; DeepMind resolves 9/353 Erdős + 44/492 OEI…

  • Andrej Karpathy

    Co-founder OpenAI, ex-Tesla AI, Eureka Labs; coined "vibe coding," Software 1/2/3.0, "ghosts not animals," "agentic eng…

  • Agentic Loops Overtake Bespoke Systems

    DeepMind's *basic* Ralph-loop agent matched its bespoke evolutionary+AlphaProof system as the LLM improved; the bitter…