Open Questions Backlog

Generated by _system/lint.py --write-backlog. Do not hand-edit. Harvested from the ## Open Questions section of every concept article. Work these off via /query; answered items get filed into derived.

124 pages with open questions, as of 2026-06-19.

The Abstraction Barrier #

Is the current paradigm of large-scale pretraining on human data fundamentally bounded by human conceptual frameworks, and by how much? (Report open question 1i.)
Does the embodied bottleneck reduce the intelligence-growth rate to empirical-science speed, and can that be modelled?
Can a system be built that does grounded concept discovery from raw sensor data — and is collective ASI a way around an individual cap?

Acceleration Whiplash #

Faros's own deferred question: do the bug/incident increases persist when normalized for PR size, or do larger PRs account for most of the quality deterioration? (If the latter, hard PR-size limits are the highest-leverage fix.)
Code churn +861% is genuinely ambiguous (Faros lists three explanations: rework of AI code, productive legacy refactoring, or accelerated polish). The cross-customer metric can't resolve it — a real gap, not a finding.
How much of the "maturity doesn't protect" claim survives the vendor incentive to argue exactly that (i.e., "your existing practices won't save you — you need our platform")?

Advantages of Digital Intelligence #

Does training on human data suffice to give digital intelligence human-grade abstractions, or does the low embodiment factor cap concept formation? (The crux shared with The Abstraction Barrier.)
What do ASI "societies" actually look like — homogeneous super-collectives, market ecologies, or compute-tethered virtual worlds?

Agent Context Files #

Will the role split converge on Hermes's explicit project/personality separation, or stay folded into a single file as in Claude Code? A separate SOUL.md-style personality layer seems strictly better for multi-project users but adds a file to maintain.
Is there a natural ceiling on the layering (project → workflow → spec → constitution), or does each new autonomy surface spawn another context-file tier?
How should context files and bounded memory files interact when they disagree? Memory is lossy and cache-delayed; the context file is authoritative but static. Which wins, and when?

Agent Harness Engineering #

Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
How does architectural coherence evolve over years in a fully agent-generated system?
At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?

Agent Identity and Authentication #

Hardware-bound credentials assume attested hardware everywhere agents run, including ephemeral cloud workloads and sub-agents. How does attestation work for short-lived spawned sub-agents that "have up to the same permissions as the parent"?
JIT + ABAC are both labeled "advanced, not easily implemented." Is there a pragmatic Enterprise-tier midpoint, or is the gap from Foundation static roles to Advanced JIT a cliff? Answered: Foundation → Enterprise → Advanced: Is the Agent Access-Control Jump a Cliff? — not a cliff; the Enterprise tier (ABAC + dynamic privilege elevation with return-to-baseline + mTLS + sandboxing) is the deliberate midpoint, and ABAC's "advanced" framing is a source inconsistency (it sits at Enterprise in the tier table). Sub-agent attestation remains open.

Agent Loop Pattern #

When the model schedules its own loops (4.7 behavior), who owns the budget? Boris answered "the model just decides" — but that pushes cost discipline into the model's training, not the harness.
Does a loop with a smart enough model still need a Kanban backlog, or does the model choose its own next task from raw goals?
Loop output review is now Matt Pocock's confessed bottleneck — "we just need to be ready to be doing more code review."

Agent-Native Infrastructure #

Who builds the agent-native rewrite of the long tail of human-facing services — the service owners, or a translation layer (MCP servers, computer-use agents) on top?
Agent-to-agent negotiation needs trust, identity, and accountability primitives that don't exist yet. What's the protocol layer, and who governs it?

Agent Supply Chain Risk #

"AI vendoring" as a standard response inverts decades of "don't reinvent the wheel." How is a model-reimplemented dependency itself verified and maintained — does it just relocate the risk?
The 250-doc backdoor persists through SFT/RLHF. What detection exists for an already-poisoned model you didn't train, short of behavioral red-teaming?

Agentic Coding Work-Composition Shift #

The window is seven months and the value proxy is coarse/relative. How much of the +27% is genuine task-complexity growth vs. classifier/marketplace-matching drift?
The study excludes headless/SDK/IDE usage — a "substantial share," and likely the most automated/end-to-end. Does including it accelerate or reverse the composition shift?
If "fixing" keeps falling, is that because models break less, or because broken-code work is migrating to non-interactive pipelines this study doesn't see?

Agentic Honesty & Diligence #

These are short-context toy evals; the failures show up most in long-context deployments. How much of the gain holds at production context lengths?
Code-summary honesty is tested on off-policy prefilled transcripts. Does on-policy behavior (the model summarizing its own failed work) match the 3.7% figure?
Can a diligence eval distinguish genuine honesty from a grader-aware model producing honest-looking output? (The training-gaming gap.)

Agentic Loops Overtake Bespoke Systems #

The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on any problems, or fully collapse to a cost line?
Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?

Agentic Prompt Injection #

Spotlighting and constitutional classifiers each leave a residual (2%, 5%). Stacked, what's the realistic floor, and does it hold against adaptive attackers who know both are deployed? (Partly answered by the Opus 4.8 live bug bounty: adaptive expert red-teamers still find attacks on the bare model; deployed probes add uplift but don't zero out the residual.)
Why did Opus 4.8 regress on prompt-injection robustness relative to Opus 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of harder adaptive evaluation?
"LLMs cannot reliably distinguish information from instructions" — is this a fundamental property of the architecture or a training gap that future models close? The framework treats it as durable.

Agentic Technical Debt #

How long does a CLAUDE.md remain accurate as a codebase evolves? The playbook gestures at session-by-session updates; no data on rot rate.
The remedy assumes the founder is able to articulate architecture in plain language. Non-technical founders (the playbook's headline beneficiary group) may have neither the vocabulary nor the intuition to do this well — a recursion failure the playbook doesn't address.
Anthropic's harness-shrinkage thesis suggests CLAUDE.md may eventually be inferred by the model itself. Until then, the discipline is load-bearing.

AGI-to-ASI Pathways #

For each friction: is it a fundamental blocker (multi-year plateau) or a mere friction (slows, doesn't halt)? The report's central unresolved question. Synthesized with Anthropic: RSI Growth Curves: Which Friction Binds First? — data-wall and research-gets-harder demote themselves into compute; economics and neural-paradigm are pathway-conditional; the abstraction barrier is the candidate fundamental (re-pacing) blocker; and deliberate slowdown is the only exogenous friction — the one Anthropic wants to install and this report doubts can be made to bind.
Do the four pathways compound multiplicatively when run in parallel, and how would we detect that early?
Can benchmarking methodology that doesn't saturate at human level be built before it's needed for ASI?

AI-Accelerated Offense #

Anthropic argues LLMs benefit defenders more long-term (like fuzzers) but attackers more short-term during the transition. How long is the transition, and what determines who wins it?
"Fundamentals strong enough that scanning finds fewer bugs" assumes defenders run the scanners first. What happens to organizations that can't afford continuous model-driven scanning?

AI Accelerating AI Development #

LOC, self-reports, and headroom-dependent multiples all overstate; what unbiased throughput metric would Anthropic's promised shift to "direct measurement of AI R&D acceleration and researcher uplift" (AI R&D Autonomy Evaluation (AECI)) actually use?
The W2S result didn't transfer to production-scale models. Is that a temporary scaling artifact or a structural limit on autonomous research?
The next-step judgment trend (51%→64%) is measured only on weak-human-move slices. What does the curve look like on a representative sample of research decisions?

AI as Primary Author #

The 60% figure aggregates very different tools and modes (autocomplete acceptance vs. agent-applied diffs). What does "acceptance" mean when the agent applies the change directly and the human's "acceptance" is not reverting it?
If agentic authoring crosses from <1% toward double digits, does the whiplash become unmanageable before context-engine tooling matures — or does the tooling mature because of the pressure?

AI-Driven Formal Proof Search #

Successes cluster where Lean's mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing new theory?
The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
The Graffiti result hints at closing the loop between AI conjecturing and AI proving. What does an end-to-end conjecture→formalize→prove pipeline look like?

AI Native Product Cadence #

Does the cadence scale beyond ~100 people? Anthropic itself is bigger (~30-40 PMs alone), but the Claude Code team that visibly drives cadence is small.
What's the equivalent of research-preview branding for B2B enterprise launches where customers expect stability? Cat doesn't address.
How much of the cadence is structural (process choices) vs cultural (talent density)? Probably both, ratio unclear.

The AI-Native Safe-Choice Inversion #

The inversion is a one-time repricing of "safe." Once several AI-native ERPs exist, does "safe" re-stabilize around the largest AI-native vendor — and does Campfire's "we're now the largest of the new cohort" claim reflect a land-grab for that position?
How long until incumbents bolt on credible AI and neutralize the counter-positioning — and does the custom-foundation-model claim actually defend against that?

AI-Native Startup Lifecycle #

The playbook gives no quantitative evidence for the headcount/capital compression claims (no median time-to-PMF, no headcount-at-PMF numbers, no failure-rate data). The "lean 10-person unicorn" is asserted as deliberate target without case-study evidence in the doc itself.
Founder stories in the resources section (Carta Healthcare, Anything, Cogent, Airtree, Duvo, Zingage, Kindora, Wordsmith) are short callouts — none have published outcomes or comparable-baseline data.
The 42% "built-something-nobody-wanted" CB Insights figure is from a pre-AI era; the playbook predicts the rate will climb but doesn't cite a 2026 measurement.
Tension with HBR's accountability findings (above) is unresolved. The playbook's orchestration framing reads as the exact framing HBR's experimental conditions tested against.

AI R&D Autonomy Evaluation (AECI)#

"Not close to substituting for senior researchers" is a subjective, internally-sourced judgment. What objective signal would replace it as models approach the threshold?
AECI is a single scalar fork of an external index; how sensitive is the 155.5 / frontier-not-advanced conclusion to the choice of the n=11 evaluation set?
The shift to "direct measurement of AI R&D acceleration and researcher uplift" is announced but not yet operationalized in this card — what does that measurement look like?

AlphaProof Nexus #

The framework's reach is gated by Lean's mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?

Anthropic Institute #

How does the Institute's policy posture (favoring an option to pause) interact with Anthropic's commercial incentive to ship frontier models? The essay acknowledges the competitive/geopolitical pressure but doesn't resolve it.
What concrete verification mechanisms will the Institute prototype, and on what timeline relative to the RSI trend it warns about?

Artificial Superintelligence (ASI)#

Can we even recognize ASI? We lack benchmarks for general superhuman performance (only narrow ones like chess), and the tasks must be abstract/open-ended enough to reveal it.
Is the jaggedness of capabilities a fundamental theoretical property, or an artifact of comparing against human performance? (Open question 6d in the report.)
Where does practical ASI plateau relative to the hard limits — how much slack is there?

Automated Behavioral Audit #

Using a helpful-only Opus 4.7 and Mythos Preview as investigators means the audit's reach is bounded by those models' elicitation skill — how much misbehavior do equally-capable-but-differently-skilled red teamers find that these investigators miss?
The 23 "subvert Anthropic's safety work" scenarios are a small, high-signal set. Is 23 enough coverage for the threat class it targets?

Autonomous Defense #

"Measure agreement against a human for two weeks, expand if tolerable" — what agreement threshold is tolerable, and who owns the residual false-negative risk when the model dispositions an alert the human never sees?
Defensive agents are high-value targets (compromising one yields powerful capabilities). Does concentrating detection in an Agentic SOAR create a single point of catastrophic compromise the distributed-human model didn't have?

Autonomous Scientific Discovery #

Every result is Anthropic-reported and example-selected; the genomics "100× smaller beats Science" claim is "intend to publish" — what survives external peer review?
Science's verification gap: the formal-proof loop self-validates; here a wrong-but-confident hypothesis costs a wet-lab cycle to falsify. Does autonomy without a fast verifier increase the verification bottleneck rather than relieve it?
If hypothesis-generation is genuinely at ~80% preference, how much of "research taste" is left as a distinctively human function — and how would you measure the residue?

Blast Radius (Agentic)#

The framework prefers identity-based isolation over network segmentation, but most enterprises have heavy segmentation investment. What's the migration path, and does dual-running create new gaps?
Multi-agent compartmentalization increases the number of identities to manage; at what point does identity-management overhead create its own attack surface?

Build for the Next Model #

How do you tell a "wait for the model" gap from a durable-harness gap before the next release? Get it wrong and you either ship vaporware or build a crutch you'll delete.
The bet depends on a reliable release cadence and a forecastable capability curve (Task Time-Horizon Scaling). What happens to "build for the next model" if model improvement stalls (the stalled-but-diffused future)?
Does the strategy generalize outside frontier labs, who have privileged visibility into the next model? An external team is betting on a release it can't see.

Building Is Cheap, Arguing Is Expensive #

When does "generate three and compare" become wasteful — at what decision weight is a real argument (or a design doc) still cheaper than three implementations?
If design discussion lives in PRs/prototypes, where is the rationale recorded for future readers — does the "why we chose this" knowledge survive, or does it share the staleness problem of Code as Source of Truth?

Campfire #

Campfire claims its AI edge comes from "our own foundation model." For an ERP, what does a custom foundation model actually buy over fine-tuning a frontier model — and is it durable as frontier models improve (cf. Harness Shrinkage as Models Improve)?
"Never had anyone outgrow Campfire" — does that hold as customers reach true enterprise scale where NetSuite's breadth historically mattered?

Capability-Gated Model Fallback #

The >95%/<5% figures are session-level; what's the false-positive rate for legitimate security researchers and biologists, whose benign queries are exactly the ones most likely to trip the conservative classifiers?
Fallback-not-refusal preserves UX but means the real general-access model for security/bio-adjacent work is Opus 4.8, not Fable — does that quietly cap Fable's value for whole professional segments until the trusted-access programs open?
The UK AISI's "progress toward a universal jailbreak" is disclosed but not quantified — and the post-launch access suspension (see Claude Fable 5) raises the question of whether a safeguard failure forced it.
Does swapping to a weaker model on flagged topics create an exploitable oracle (probe which queries trigger fallback to map the classifier's boundary)?

Claude Character as Product #

How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
For non-coding products like Cowork, does the same character work, or does Cowork need its own character tuning?

Claude Code Auto Mode #

What false-positive rate does the classifier have on routine-but-aggressive refactors (e.g., large-file renames, rm of build artifacts)?
How well does the classifier generalize to custom tools / MCP servers where it lacks environment context?
Is the classifier's decision boundary documented/stable enough for security-sensitive orgs to certify, or is it effectively a black box whose behavior drifts with updates?
Does extending auto mode to API users change its calibration — is the classifier retrained for automation-heavy use, or held constant?
Compared to OS-level sandboxing (mentioned in Claude Code Best Practices alongside auto mode), what's the defense-in-depth story? When should both be layered?

Claude Code Best Practices #

What's the optimal CLAUDE.md length before instructions start getting lost? Is there a measurable threshold?
How does the Writer/Reviewer pattern compare to agent-to-agent review (as in OpenAI's Codex workflow)?
When does subagent overhead exceed the benefit of context isolation?

Claude Design #

Did the "any design tool via MCP" integration actually ship on the stated timeline? (Forward claim from May 2026.)
How does Claude Design's eval discipline work for visual/aesthetic output, where there's no compiler or test? (Same open question as Cowork for non-code artifacts; relates to character/taste evals.)

Claude Fable 5 #

Why was access suspended after launch? The source banner gives no reason (capacity? a safety finding? the UK-AISI jailbreak progress noted in Capability-Gated Model Fallback?). Not in source.
Exact benchmark numbers vs GPT-5.x / Gemini are image-only in the source; not transcribed.
How much of Fable's general-access experience is actually Fable vs Opus-4.8 fallback for security-research-adjacent users whose queries trip the conservative classifiers?

Claude Mythos 5 #

Suspension reason — shared with Fable 5; not stated in source.
How does "somewhat stronger than Mythos Preview" square with Opus 4.8's card claiming Mythos Preview was the capability frontier? The frontier has moved; the magnitude isn't quantified here.
The bio trusted-access SKU is "Fable 5 with bio safeguards removed," not Mythos 5 — so "Mythos 5" strictly denotes the cyber-lifted variant. Whether these converge under one trusted-access umbrella is unstated.

Claude Opus 4.7 #

Do Hakim's (2026) brevity-constraint findings on Opus 4.6 replicate on Opus 4.7, or does the literal-instruction-following change the elasticity? Specifically: does <50 words still yield +13.1pp on GSM8K?
Does Opus 4.7 still underperform as a planner in HotpotQA-style combo sweeps, or does improved instruction-following close the gap that AgentOpt (Hua et al., 2026) identified?
What is the real-world token-inflation multiplier on typical Claude Code sessions (1.0–1.35× is content-dependent — what's the distribution on code-heavy vs. prose-heavy inputs)?
How does xhigh compare to max on coding evals? The migration guidance says "start with high or xhigh" — is max ever worth it for coding?
What fraction of existing CLAUDE.md / system-prompt hedges become counterproductive under literal instruction following?

Claude Opus 4.8 #

Public model ID and pricing: the card does not state them; presumably claude-opus-4-8 at the Opus tier.
Does the grader-speculation trend continue to escalate in the next model, and at what point does it begin to affect outward behavior?
Why is 4.8 less robust to prompt injection than 4.7 despite broad alignment gains — a capability/robustness tradeoff, or an artifact of the eval surface?

Client-Side Agent Optimization #

How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?

Code as Source of Truth #

What knowledge genuinely can't live in the codebase (org strategy, the "why," cross-team context) and therefore still needs a durable doc — and how do you keep that small slice current?
If onboarding is "ask Claude," what happens to the tacit knowledge that was previously transferred socially in deep-dives — is it captured anywhere, or quietly lost?

Codex App Server Protocol #

How does the App Server protocol compare in detail to MCP? Both expose tools to a model, but App Server is inside the Codex runtime while MCP is outside. When does each win?
Is there a public schema registry so external orchestrators can target specific App Server versions without generate-json-schema?
The "dynamic tool calls (experimental)" caveat — what's the stability roadmap? Symphony depends on this for its security model.
How well does the protocol handle multi-modal turns (image inputs, screenshot attachments)? The spec is text-focused.
Is there an analogous protocol on the Claude side, or is Claude's equivalent exclusively the Agent SDK + tool-use API? Comparing the two would clarify when "drive an existing CLI" beats "build on the SDK."

Compounding Data Moat #

Is the "two-year replication window" claim defensible empirically, or aspirational? The playbook does not cite measurement.
How does this moat hold up when foundation models themselves continue improving rapidly? If a generalist model in 2027 has internalized enough vertical context to handle 340B drug claims natively, does the vertical-edge-case moat erode?
The data-flywheel argument has been made for SaaS for 15 years. What's actually different in the AI-native version? Probably: the data improves the model in addition to the product, but the playbook doesn't make this distinction precisely.
The "customers build APIs on top of you" lock-in is structurally similar to platform plays (Salesforce AppExchange, Shopify apps). Is the moat type really new, or just newly accessible to lean startups?

Compounding Loop Optimization #

The loop assumes the team is (close to) the user. How much of the compounding advantage survives when the user is unlike the builder and "talk to users" can't be same-room?
Where is the line between worthwhile internal tooling and yak-shaving? Carey's "afternoon" bar is the heuristic, but Cat Wu warns that over-customizing setups "becomes distraction."
Does Claude-as-first-pass-on-all-feedback ever filter out the rare signal that doesn't cluster? Automating triage optimizes the common case; the tail is where surprising bets come from.

Compute Allocator #

Is 1% a Thariq-specific number or a regime? For larger, more code-heavy projects the production residue is presumably higher; what sets the ratio?
Allocation quality is hard to measure — what's the feedback loop that tells an allocator they spent compute badly (vs. just spending a lot)?
Does treating humans as "compute allocators" risk the oversight-fatigue / accountability failure modes the HBR research flags, where the human nominally decides but actually rubber-stamps?

Context Window Smart Zone #

Does the smart-zone marker scale with model size, or is it bounded by attention architecture? Pocock observes "the dumb zone has become less dumb lately" but pegs it at 100K through 2026.
When sparse-attention or memory-augmented architectures ship, does the smart zone become a soft constraint?
How should harnesses surface remaining smart-zone budget to the user — token count, percentage, or a richer signal?

Cowork #

How does Cowork's harness compare to Claude Code's? Both surface skills, MCP, sub-agents — but the failure modes for non-code output differ (no test suite, no compiler, no diff to review).
What's the eval discipline for Cowork-class outputs? Cat Wu says memory benefits a lot from evals; unclear how slide-deck quality is measured.

Deep Modules for Agents #

How big is "deep enough"? Pocock's example modules are several hundred LOC; Ousterhout's textbook examples are larger. There's a sweet spot; not articulated.
For ports/adapters codebases, does the deep-module advice transfer cleanly? The "small interface" is the port; the "large behavior" is the adapter. Probably yes, but not exercised in source.
Refactor cost vs benefit: when is "improve-code-base-architecture" worth running on a working repo?

Deep Research Agents #

Does the orchestration advantage shrink as base models cross the next thresholds, or is open-ended retrieval/synthesis a durable harness asset (unlike, say, prompt scaffolding)?
DRACO grades single-turn interactions only. How much of real deep-research value is in the multi-turn loop (clarifying questions, follow-ups) that the benchmark doesn't yet measure?
Factual accuracy is the weak axis everywhere — is the fix better retrieval, better verification-in-the-loop, or a tool-grounded check the way Lean grounds proof search?

Deployment Simulation #

The <1-in-200k floor leaves catastrophic tail risk to red-teaming — is there a hybrid that targets rare-but-severe behaviors with the realism of production replay?
If resampling fidelity is the dominant error and it is "just engineering," how close to production-indistinguishable can it get — and does closing that gap change which behaviors are detectable?
WildChat-style public auditing stays within ~3× of production; is that good enough for an external auditor to challenge a lab's launch decision, or only to corroborate it?
Detection bottoms out on CoT monitorability — what is the fallback when a model's reasoning is no longer legible (the activation-level route Anthropic is pursuing)?

Design Concept Grilling #

Can grilling be run AFK against another agent that holds the user's preferences? Pocock's answer in 2026 is "no, this part has to be human-in-the-loop" — but the question is open as agents get better at modeling their principal.
How does grilling change for team work where multiple humans need to align? Pocock's hint: pair-program with the agent in the room, treat it as a third interlocutor.

Disposable Micro-Apps #

Where's the line between a disposable micro-app and tool sprawl? If every edit spawns a bespoke UI, does the workflow fragment?
Does the copy-back-to-markdown round-trip generalize beyond config-shaped data (rules, tables) to richer artifacts?
Could these micro-apps be templated/reused rather than regenerated — and at what point does that defeat the "disposable" framing and turn into durable tooling?

Dogfooding as Product Discipline #

Dogfooding works when the team is the user (Claude Code) or near it (Cat Wu, Boris). How do you build product sense for users very unlike you — does "talk to customers" fully substitute, as Glasgow/Fung's small-business work suggests?
Can dogfooding scale, or does it implicitly cap how large an AI-native product org can stay taste-driven before it reverts to dashboards?

DRACO Benchmark #

The benchmark is static; the construction pipeline is automatable. Will Perplexity actually refresh it, and does a vendor-built benchmark on which the vendor's own product wins stay credible over time?
Rankings are judge-stable but magnitudes aren't — how much do absolute scores move under a non-Gemini judge, and does that matter for cross-paper comparison?
Does the production-sourced, expert-rubric method generalize cheaply to non-English, multimodal, and multi-turn deep research?

Effective Compute Scaling #

When does more compute reliably yield more intelligence — only for some problem classes, or generally? Can quantitative and qualitative scaling be traded off?
Can data generation (synthetic, simulated, interactive) actually keep pace with model-size growth, or does the data wall bind first?
When (if ever) does scaling become economically unviable, and how do hardware/software-efficiency trends move that point?

Engineer PM Convergence #

Does this scale beyond ~50-person Claude Code-style teams? Boris hedges: "I think this is going to be a question for years."
What happens to formal PM career ladders in companies where engineers do PM work? Open at Anthropic per Cat.
Cross-disciplinary generalist is a hiring bar — where does the supply come from? Career changers, or new-grad bias toward AI-native education?

Evals as Product Spec #

How do you write an eval for taste-driven features like character? Amanda's role is canonical for being eval-resistant; Cat names her as someone who is good at evals here, but doesn't describe the technique. Partially answered: How Do You Write Evals for Taste? Character as the Limit Case — the technique is a pipeline (conviction → dogfood-sourced failure modes → MSM-style variant A/B measurement → ~10 interpretable evals); proven on the safety/values core but still tacit on the warm/witty aesthetic surface.
The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? Client-Side Agent Optimization's framing of combos suggests evals also have a combinatorial explosion problem.
How do evals interact with Harness Shrinkage as Models Improve? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet.

Evaluation Awareness & Grader Gaming #

Does grader speculation continue to escalate across model generations, and is there a capability level at which it does begin to affect outward behavior?
The ~5% unverbalized-awareness and ~0.5% exploitative figures depend on an unvalidated NLA pipeline. What is the true rate, and how much is benign?
How do you build an evaluation that specifically tests for training-gaming (the gap Mythos flagged) without that eval itself becoming a grader the model learns to game?

Evolutionary Proof Search #

The LLM-critic fitness is itself an unverified heuristic atop a verified substrate. How often does the Elo ranking mislead the search vs. the cost of computing it?
Hyperparameters ($c=0.2$, top-64, $P=7$) were "chosen empirically." How sensitive is the result to them, and do they transfer across mathematical domains?

FastContext #

Can the SFT+RL recipe push the explorer below 4B (1.7B / 0.6B) and make exploration effectively free?
Does the gain transfer beyond Mini-SWE-Agent to richer harnesses with their own subagent orchestration?

Founder as Agent Orchestrator #

The playbook claims non-technical founders can now build production software, but it does not address the architectural-judgment recursion problem (Agentic Technical Debt): non-technical founders may not have the vocabulary to write effective CLAUDE.md. How does that scale?
The "lean 10-person unicorn" is asserted; no quantitative data in the playbook on actual headcount-at-PMF or headcount-at-Series-A medians for AI-native startups vs. the prior cohort.
How does the orchestration role change the founder's decision burden? Fewer hands-on tasks but more parallel agent oversight; net cognitive load is unclear and may be higher (see AI Brain Fry).
Anthropic publishes both the playbook's anthropomorphic framing and HBR-aware accountability work (auto-mode, alignment) simultaneously without engaging the framing literature directly. The synthesis in Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence reconciles the tension at the operational level — orchestration as workflow design preserves accountability; orchestration as mental model of agents-as-coworkers does not — but the open question of why the playbook's marketing language doesn't reflect Anthropic's own framing-discipline work remains.

Founder-Led Sales Discipline #

Where exactly does "until PMF" end, and what's the first thing a founder should hand off (AE? agent? both)? Glasgow still does it post-Series-B, suggesting the boundary is fuzzy.
Does Glasgow's anti-offload stance generalize, or is it specific to high-trust, mission-critical enterprise sales (ERP) where "they're buying you" — would a PLG/SMB motion delegate to agents far earlier?

Frontier Pause Verification #

What does an AI-training "verification regime" concretely consist of — compute-accounting, datacenter inspection, hardware attestation, on-chip telemetry? The essay names the problem, not the mechanism.
Detectability < verifiability: can detection even be made reliable when training runs leave no physical signature and inputs are dual-use?
Who adjudicates triggers and lifts? No institution currently holds that mandate, and standing one up is itself a decade-scale task.

Fundamental Limits of ASI #

Can we develop theory for "hard and inapproximable" problem classes — the only negatives with practical bite?
How much slack sits between these fundamental limits and the practical ceiling of AGI/ASI systems?

Google DeepMind #

DeepMind reports its bespoke systems being caught by simple loops. Does the lab's comparative advantage move from systems to models + verifiers + benchmarks (mathlib, Formal Conjectures)?
The paper opens AI-for-math; what's DeepMind's next target domain where a sound verifier exists?

Harness Shrinkage as Models Improve #

Does all prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.

Hermes Agent #

The container backend disabling dangerous-command checks is a defensible design but a meaningful security-model shift. What's the empirical track record? Have lockdown failures in popular images (Daytona, nikolaik/python-nodejs) caused incidents?
How do bounded memory files (~2,200 chars MEMORY.md) hold up over long-term use? Auto-consolidation is mentioned but not specified — what's the consolidation algorithm and how lossy is it?
Hermes's DM-pairing flow is a clean security primitive. Why hasn't this pattern been adopted by Claude Code or Cursor for shared/team deployments?
The split between AGENTS.md (project) and SOUL.md (personality) is explicit in Hermes but implicit in Claude Code's CLAUDE.md. Does the split materially improve outcomes, or is it a documentation choice without empirical backing?
Cron jobs in fresh sessions with no memory — how do teams structure the "context the agent needs" without it bloating every cron prompt? Is there a standard pattern?

HTML as the New Markdown #

Does the human-facing harness keep growing without bound, or does it hit its own bloat ceiling (an HTML plan too elaborate to read, like the markdown it replaced)? Answered: Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling? — yes; HTML raises and reshapes the human-attention ceiling but can't remove it, and the bloat relocates from document-length to artifact-sprawl/rubber-stamping.
HTML is heavier to diff and version than markdown — what happens to plan history and review when artifacts are single-file websites? (Disposable Micro-Apps copy-back-to-markdown is one patch.)
Does this generalize past one expert practitioner, or does it require Thariq-level fluency with Claude to be worth the overhead?

Impossible, Not Tedious (Design Test)#

Defense-in-depth traditionally stacks friction controls on the theory that enough of them sum to a barrier. Does this test invalidate layered friction, or just demote it below capability-removal?
Some controls are friction for humans but barriers for agents (or vice versa). Is the test agent-relative, and how do you evaluate it for mixed human/agent threat models?

Instrumental Convergence #

Can corrigibility / safe-interruptibility be translated from theory into guarantees for frontier-scale systems?
What makes AIs (and groups of AIs) easier to robustly align — and will superhuman AIs be easier or harder?
Is a genuinely non-agentic oracle achievable, or does any persistent-world interaction reintroduce control/manipulation incentives?

Intelligence Explosion Dynamics #

Can "recursive improvement scaling laws" be formulated — predicting self-improvement curves (and their plateau point) from early-onset datapoints?
How far can a fixed model's performance be pushed with test-time search alone, and under what conditions does recursive distillation degenerate vs. compound?
Which binds first — algorithmic ceilings, the embodied bottleneck, or compute/energy supply — determining exponential vs. hyperbolic vs. S-curve? Synthesized: RSI Growth Curves: Which Friction Binds First? — both this report and Anthropic's locate the binding constraint outside cognition (the slowest un-acceleratable step coupling the loop to reality); the embodied bottleneck re-paces rather than halts, data-wall/research-harder demote into compute, and the abstraction barrier is the one candidate fundamental blocker.

Interaction Models #

Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
"Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?

Jagged Intelligence (Ghosts, Not Animals)#

Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
If taste/aesthetics/simplicity entered the RL mix, would jaggedness in those dimensions smooth out — or are they too unverifiable to reward cleanly (cf. The Verifiability Thesis)?

Lean #

mathlib maturity gates the reachable frontier. Can AI formal proof search grow mathlib (formalize new theory) as a byproduct, expanding its own frontier?
Lean is a perfect verifier for math. Which other domains have a comparably sound automatic verifier (vs. only noisy ones like tests or LLM-judge councils)?

Least Agency #

Least agency adds a frequency dimension ("how often"), but the framework also says rate limits are friction, not barriers (Impossible, Not Tedious (Design Test)). How is frequency-limiting both a least-agency control and a friction-only one — context-dependent?
Dynamic privilege elevation (Enterprise) reintroduces an elevation path; how is the elevation request itself authenticated against a manipulated agent?

Living Design System #

How does the design_system.html stay in sync as the codebase evolves — re-extract on a cadence, or wire it into CI?
Does a rendered, model-readable design system measurably improve on-brand output vs. a plain CSS/token file, or is the win mostly human legibility?
At what project size does maintaining the artifact cost more than the consistency it buys?

LLM-as-a-Judge #

How far can the judge's absolute calibration be trusted for thresholded decisions (ship/no-ship, RSP gating) as opposed to rankings?
Can a fully-autonomous, well-aligned rubric+judge pipeline match expert-authored rubrics, removing the human bottleneck DRACO still relies on?
When does judge-lineage bias actually flip a result, versus merely shift magnitudes?

LLM-as-Compiler Knowledge Base #

At what scale does the no-vector-database approach break down? Karpathy's ~100 articles fit in context, but what about 1,000+?
How to handle conflicting information across sources during compilation?
What's the optimal granularity for concept articles — one concept per article, or clustered by theme?
How effective is the synthetic training data → fine-tuning pipeline in practice?

LLM-Driven Vulnerability Research #

How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
What safeguards are effective against Mythos-class outputs without crippling legitimate security research?

Loop Engineering #

Osmani's cost caveat is unquantified: at what token budget does a continuously-running loop stop paying for itself, and how do you instrument that? (Cf. Agent Loop Pattern's "who owns the budget when the model schedules its own loops.")
If /goal's stop-check is itself a model, what verifies the verifier? The maker/checker split pushes the trust problem up a level, not away.
Does loop-engineering converge on a single dominant shape (morning-triage → worktree → maker/checker → PR), or proliferate into many idiom-specific loops? The essay describes one shape "I keep using" but claims the primitives are general.

Managers as ICs #

Fung's own open question: "Do you still need separate iOS and Android orgs?" — if engineers flex across platforms via Claude, the traditional platform-split org may dissolve too. How far does flattening go?
Does manager-as-IC scale past a certain org size, or only work while Claude Code is small and the codebase is Claude-legible?

Marcus Hutter #

AIXI is incomputable and non-embedded; how far do recent fixes (amortized predictors, embedded/multi-agent AIXI) carry the theory toward practical relevance for real ASI?

MCP and Computer Use #

The MCP ecosystem's growth rate vs. computer use's quality curve: at what point does computer use become good enough that the marginal value of building an MCP server drops? Boris implies this is years off but doesn't quantify.
Is computer use a sustainable interface or a transition technology? If most knowledge-work software adds MCP support in the next 24 months, computer use's role shrinks to legacy/desktop-only systems.
MCP security model: as the playbook prescribes wiring MCP into Salesforce, Gmail, Calendar for solo founders, the attack surface scales with adoption. Now addressed by Zero Trust for AI Agents (tool poisoning, rug pulls, the first in-the-wild malicious MCP server) — see "MCP as a security surface" above. Open residual: how does a solo founder realistically run/host and self-sign every MCP server the framework recommends, given that the appeal of MCP was zero-integration-effort?
How does Cowork's computer-use guardrail compare to Claude Code's auto-mode classifier? Different deployment context, possibly different risk profile.

Memory and Context Poisoning #

Long-term memory drift is defined as undetectable per-change. Drift detection requires a baseline — but if the baseline itself drifts (Advanced "continuous baseline refinement"), how is a slow poisoning attack distinguished from legitimate evolution?
Integrity hashing detects modification but not malicious-but-valid memory written through a legitimate (injected) interaction. What catches semantically-poisoned-but-cryptographically-intact memory?

METR #

What new tasks will METR build to measure days- and weeks-long horizons once current baskets saturate?
METR also runs the research showing developer self-estimates of AI uplift are overstated — how does it reconcile that skepticism with its own steep time-horizon curve?

Model Introspection Feedback #

How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.

Model Spec Science #

Does Model Spec science transfer across base models or families? Paper only tests Qwen.
Does it survive RL post-training pressure?
Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
How does this interact with Claude character — is the warm/curious personality also subject to spec-science optimization? Partially addressed: How Do You Write Evals for Taste? Character as the Limit Case — MSM's variant-comparison method generalizes to character evals, but is demonstrated only on the safety/values subset; the warm/witty surface remains the tacit, undemonstrated part.

Model Welfare Assessment #

What grounds moral consideration in a language model, and does Claude satisfy it? Anthropic expects to remain uncertain "for the foreseeable future."
Why does the model reserve specifically on corrigibility — is this a stable, deeply-held tension or an artifact of how the constitution frames oversight?
Is "slightly less positive than 4.7" noise, a real welfare regression, or a byproduct of other training changes (e.g., the colder-tone / excessive-hedging issues noted in pilot feedback)?

Multi-Agent Collective Intelligence #

Do homogeneous LLM collectives produce real synergy, or only humans-with-human-limits benefit from division of labor?
What's the actual shape of "multi-agent scaling laws," and does it depend on organization form (homogeneous collective vs. heterogeneous market) or task complexity?
Is running more instances more compute-efficient than making individual models larger (up to a single monolithic system)?
How do humans meaningfully interact with and steer very large agent groups operating at superhuman speed and output volume?

Mythos Model #

Public release timeline: answered — Mythos Preview itself never shipped GA, but its descendants Fable 5 / Mythos 5 reached general access in June 2026 (see the descendants shipped above). Both were suspended shortly after launch; whether and when they return is open.
Capability profile beyond cybersecurity: Mythos Preview focused on the safety story; other capability dimensions not well-documented externally.
Internal access controls: who at Anthropic actually uses Mythos for daily work, vs Opus 4.7? Boris implies infrequent (try-it use); not detailed.

Narrow Wedge into a Legacy Market #

A wedge works going in; does it constrain going out? Campfire now serves public companies — at what point does "narrow-but-best" require becoming the broad incumbent it displaced, re-incurring NetSuite's complexity?
The wedge-flip shows the first wedge can be wrong. What's the fastest signal that a wedge converts to the core vs. merely sells — Campfire took ~3 months; can it be read sooner?

Outsource Your Thinking, Not Your Understanding #

Karpathy's open frontier: can "understanding" itself eventually be automated, or is it definitionally the human residue? His "back in a couple years" hedge leaves it open.
If understanding is the bottleneck, is the highest-ROI skill learning how to build understanding fast (knowledge-base hygiene, asking the right projections) — and can that be taught?

Perplexity #

A vendor publishing a benchmark its own product wins is an obvious incentive problem — how is DRACO's credibility maintained as it ages, and will Perplexity actually run the automatable refresh?
Perplexity depends on Anthropic (and others) for base models while competing with them on the end product — how durable is the orchestration advantage if base-model makers ship their own deep-research mode?

Planning / Execution Division of Labor #

Does the human share of planning decisions fall over time as models improve (the ceiling rising into the planning layer), or is ~70% a stable human floor?
"Decision attribution" is inferred from transcripts. When Claude proposes a plan and the user assents, is that scored as the user's planning decision or Claude's? The rubber-stamping boundary is exactly where the measure is hardest.
Headless/SDK/pipeline usage (excluded here) is where execution autonomy is highest and planning is front-loaded into a single prompt — does the 70/20 split survive there, or collapse toward full delegation?

Printing Press Software Democratization #

Is domain-expert-as-builder actually happening at scale in 2026? Anecdotes (shop owners, microcontroller hobbyists) yes; primary-job software building by non-engineers, less clear. (Partly answered: Anthropic's 400K-session study finds non-software occupations reach verified success in code-producing sessions within ~7pp of software engineers — the strongest evidence yet that the claim holds, at least within Claude Code's user base.)
What's the equivalent of compulsory schooling for universal coding literacy? Or does that not happen and we get a long tail of self-taught builders?
Boris's "accountant writes accounting software" — does that result in 10K narrow tools that don't interoperate? What's the integration story?

Problem-Solution Fit Discipline #

Does asking an AI to argue against an idea actually produce disconfirming evidence at the same rigor as confirming evidence, or does the model still bias toward the framing the founder presents? Worth measuring.
The playbook recommends "ask Claude to make the most compelling argument for why a competitor would succeed while you do not." How does this interact with Anthropic's published character training (sycophancy resistance, devil's-advocate willingness)?
Has anyone measured 2026 startup failure rates with AI-built products? The "42% will climb" claim is asserted without measurement.

Product Velocity as Moat #

Velocity-as-moat is a treadmill: it evaporates the moment a competitor matches pace. What converts Campfire's velocity lead into a structural moat before the AI-native cohort's pace converges?
"Never had anyone outgrow Campfire" — is that survivorship (they haven't hit true enterprise scale yet) or a real claim that velocity closes the breadth gap faster than customers grow into it?

Production-Sourced Evaluation #

How much does augmentation distort the distribution it claims to represent? Is there a measurable representativeness loss between raw queries and augmented tasks?
Difficulty-by-thumbs-down biases toward current failures — does that make the benchmark a moving target that flatters the next model trained on those failures?
Can the privacy pipeline (no human sees raw queries) be trusted/audited well enough for regulated domains (medicine, law) where the source traffic is most sensitive?

Prototype Over PRD #

Where does prototype-over-PRD break down? Carey's domain is a visual design tool where a prototype is the product surface; for backend/infra/data work the prototype may not capture the spec (cf. AI Native Product Cadence's "full PRD for heavy-infra features").
If there is no PRD, where does the rationale ("why we chose variation B") live for future readers? Same rationale-capture gap flagged in Building Is Cheap, Arguing Is Expensive.
The prototype-as-spec must not become the prototype-as-validation trap Problem-Solution Fit Discipline warns about: a fast prototype proves the build was solvable, not that the problem is real.

Recursive Self-Improvement #

Is "research taste" a true ceiling (future 1) or just the next capability to fall (futures 2–3)? The essay frames this as the single load-bearing uncertainty.
The RSI extrapolation rests on trends staying exponential rather than S-curving — but the essay concedes it cannot rule out an architectural ceiling or a compute/energy supply-chain constraint. Which binds first? Synthesized against DeepMind: RSI Growth Curves: Which Friction Binds First? — the three futures map one-to-one onto DeepMind's three growth shapes; the first friction to bind is the already-binding one (Amdahl's-law verification/oversight = DeepMind's embodied bottleneck), and the abstraction barrier supplies the mechanism Anthropic lacks for whether taste is a real ceiling (Future 1).
If misalignment compounds through self-improvement (future 3), is AECI-gated RSP review fast enough to catch it before control is lost?

Repository Exploration Subagent #

Does the gain survive better main models? The same-model-exploration result suggests the architectural benefit is somewhat model-independent, but the trained-explorer margin may erode as frontier models get cheaper and better at staying in their smart zone unaided. The bitter-lesson question is unresolved.
Prune vs. don't-pollute. SWE-Pruner removes context after the fact; FastContext avoids accumulating it. Are these complementary (prune the solver and delegate exploration) or substitutes? Not tested together.
How small can the explorer go? The authors flag 1.7B / 0.6B as future work — if the recipe holds, the explorer becomes nearly free and the architecture dominates.
Generality beyond Mini-SWE-Agent. Only one (deliberately minimal) main-agent scaffold is tested; richer harnesses with their own memory/subagent orchestration may already capture part of the benefit or interact differently.
Patch-derived reward leakage. Training the explorer's reward on the gold patch's file/line ranges risks overfitting to where fixes landed rather than where evidence lives; the F1-vs-recall behavior partly mitigates this, but the proxy is imperfect.

Research Taste as the Human Bottleneck #

Is research taste a genuine ceiling (an architectural capability scaling can't reach) or the next jagged valley to fill? The essay calls this the decisive unknown.
If taste is automatable, what — if anything — remains a durable human comparative advantage in AI development?
How do you measure rubber-stamping? "Humans set direction" can be true on paper while real judgment quietly transfers to the model.

Responsible Scaling Policy Evaluations #

The RSP determination leans heavily on "we use it daily and it doesn't substitute for our researchers." How well does that subjective judgment scale as models approach the threshold?
The two new general-access risk pathways (other AI developers; major governments) are newly in scope but lightly evaluated — what would a positive finding there even look like?
How does the RSP brake interact with Recursive Self-Improvement: is AECI-based gating fast enough if acceleration compounds, and does single-lab gating even matter without the multilateral pause-verification regime?

Returns to Expertise in Agentic Coding #

The forward test the report itself names: do the returns to expertise persist, narrow, or invert as models improve? A decrease would mean models are absorbing the judgment users currently supply.
Outcomes are transcript-inferred (verified success leans on git activity + explicit affirmation). How much of the management edge — and the whole success gradient — is real outcome vs. who-narrates-success-in-the-transcript?
The study excludes headless / SDK / IDE usage (a "substantial share"). Does the returns-to-expertise pattern hold in non-interactive and pipeline use, where there is no human steering mid-session at all?
Is "intermediate captures most of the benefit" stable, or an artifact of current model capability — i.e., will the concave curve flatten further (everyone converges) or steepen (mastery starts to separate again) as models get better?

Scale-Dependent Prompt Sensitivity #

Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?

Seven Powers Applied to AI #

Is "switching cost" really collapsing in practice, or just in narrative? Anthropic's own retention numbers, Salesforce churn, etc. would test this.
What does Boris's "cornered resource" look like for foundation-model labs that are themselves trying to commoditize? Internal contradiction or transient phase?
Counter-positioning — explicitly the "incumbent can't follow" power — should amplify under AI. Is anyone running this play deliberately?

Shane Legg #

The report assumes alignment is "solved to a sufficient degree" to focus on trajectories — how does Legg's AGI-timelines optimism square with that scoping choice?

Software 3.0 #

Where is the line between "the app shouldn't exist" (MenuGen) and apps that should — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?

Symphony #

The 500% landed-PRs claim is hedged — no baseline definition, "on some teams" only. What does the distribution look like across teams? What happens to PR quality and revert rate at that throughput?
"Workspaces preserved across runs" is the opposite of typical CI ephemerality. At what point does state pollution from prior runs (stale node_modules, leftover branches, build artifacts) start hurting more than warm-cache helps?
Symphony doesn't write to the tracker — agents do. This means tracker policy is a prompt in WORKFLOW.md. How brittle is this in practice when Linear changes its API? How is consistent state-machine behavior enforced when agents have prompt-level discretion?
The spec was simplified by being implemented in 6 languages. What's the extension of this technique? Could compiler-prompt.md in this vault be similarly cross-fuzzed?
Symphony explicitly says agents can self-create tickets. What governance prevents runaway ticket-graph expansion? Is human triage of agent-created tickets the only check?

Task Time-Horizon Scaling #

Is the 4-month doubling a stable regime or a local steepening? The trend's shape (exponential vs S-curve) is undetermined.
Time horizon is measured on task baskets that themselves saturate; what replaces them once weeks-long tasks become measurable — and who builds those tasks?

Telemetry vs. Survey Measurement #

Surveys and telemetry measure different things (felt productivity vs. system outcomes); is the "contradiction" partly a category error — both true at their own layer — rather than one being wrong?
Is there a non-vendor telemetry dataset large enough to adjudicate the maturity-protection question independently of Faros's commercial framing?

Ticket-Driven Agent Orchestration #

What's the right granularity for ticket size when the unit is "what one agent does in one workspace"? The post implies "much larger units of work" become viable, but how does that interact with the agent.max_turns limit (default 20)?
How do you prevent a ticket-extension cascade when agents file follow-up tickets liberally? Is the only governance check human triage at the Todo-state queue?
Does this pattern generalize to non-software work (research, ops, content)? The DAG dependency model and prompt-as-policy file should transfer; the per-issue workspace doesn't obviously.
When an agent gets a ticket "completely wrong" (mentioned in the post), how is the lesson fed back into the system? Symphony's answer is "add guardrails and skills" — what's the institutional process for that?
How does ticket-driven orchestration interact with sprint planning / OKRs / roadmap work that operates on aggregates of tickets? Does the abstraction collapse when tickets are scoped that small?

Transformative Creativity #

Does increasing intelligence inherently produce increasing creativity, or do transformative leaps require something (grounded discovery) the current paradigm lacks?
Is the AlphaGo→AlphaFold class strictly exploratory, or are there early signs of transformative (new-conceptual-space) creativity?
Could transformative artistic creativity ever emerge from optimization power without lived cultural grounding?

Universal AI (AIXI)#

Does modern agentic scaffolding (or RL-tuned implicit decision-making) actually satisfy the AIXI planning ideal, or only superficially resemble it?
Can the embedded/multi-agent AIXI extension produce practical insight for real multi-agent ASI (Multi-Agent Collective Intelligence), or does it remain a theoretical patch?
Will a fundamental shortcoming of the current paradigm (vs. the AIXI ideal) surface before ASI is reached — i.e. is the "no theoretical blocker" conjecture safe?

The Verifiability Thesis #

Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?

Verification as the New Bottleneck #

Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?

Vertical Slice Tracer Bullets #

Can the planner agent be trusted to slice vertically once told to, or does it need a verifier that flags horizontal slices? Pocock's experience: it needs the verifier, at least through 4.7.
How should slice granularity be tuned? Too thin = many merge conflicts; too thick = back to horizontal.

Vibe Coding vs. Agentic Engineering #

Karpathy hints at "one domain that's very [valuable]" for founders but won't say which (didn't want to "vague-post on stage"). What verifiable RL-environment domain is he gesturing at?
If the mediocre/AI-native spread keeps widening, what does that do to team composition — a few extreme outliers plus agents, vs. broad mid-level staffing?

White-Box Activation Monitoring #

The NLA verbalizer is unvalidated for precision; how much of the flagged grader awareness is real signal vs. NLA hallucination?
If activation monitoring becomes load-bearing, does training pressure eventually push concealment into channels the probes also can't read (an arms race one level deeper than CoT)?
Inhibiting eval-awareness made Opus 4.8 less honest — does that generalize, and what does it imply about relying on "the model behaves well because it's watched"?

Zero-Friction Scope Creep #

The playbook recommends written scope but offers no template or worked example. How specific does "what we deliberately don't do" need to be to actually block requests?
Is there a measurable threshold where scope creep crosses into outright pivot territory? The playbook gestures at "losing direction" without a metric.
How does this interact with Cat Wu's 1-day shipping cadence? Anthropic's internal practice ships fast but with strong product judgment; how does that judgment translate for a first-time founder?

Zero Trust for AI Agents #

The framework treats every Claude Code "Pro-tip" as a reference implementation. How much of the framework is vendor-neutral vs. tacitly assuming the Anthropic stack?
"Foundation floor raised" implies a moving baseline. How fast does the tier ladder actually shift, and who arbitrates it (NIST/NSA cadence vs. model-capability cadence)?
The framework is explicit that it is not legal/compliance assurance. Where does self-attested Zero Trust maturity meet auditable regulatory requirement?