H
Howardismvol. 03 · quiet corner of the web
Plate IIArchitectureHOWARDISM

Open Questions Backlog

PublishedMay 25, 2026FiledEssayTopicArchitectureReading29 minSourceAI-synthesised

_62 pages with open questions, as of 2026-05-25._

Illustration for Open Questions Backlog

Generated by _system/lint.py --write-backlog. Do not hand-edit. Harvested from the ## Open Questions section of every concept article. Work these off via /query; answered items get filed into derived.

62 pages with open questions, as of 2026-05-25.

Agent Harness Engineering#

  • Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
  • How does architectural coherence evolve over years in a fully agent-generated system?
  • At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
  • How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?

Agent Loop Pattern#

  • When the model schedules its own loops (4.7 behavior), who owns the budget? Boris answered "the model just decides" — but that pushes cost discipline into the model's training, not the harness.
  • Does a loop with a smart enough model still need a Kanban backlog, or does the model choose its own next task from raw goals?
  • Loop output review is now Matt Pocock's confessed bottleneck — "we just need to be ready to be doing more code review."

Agent-Native Infrastructure#

  • Who builds the agent-native rewrite of the long tail of human-facing services — the service owners, or a translation layer (MCP servers, computer-use agents) on top?
  • Agent-to-agent negotiation needs trust, identity, and accountability primitives that don't exist yet. What's the protocol layer, and who governs it?

Agentic Loops Overtake Bespoke Systems#

  • The bespoke advantage is dated "for now." What's the next model generation's verdict — does the evolutionary/AlphaProof apparatus survive on any problems, or fully collapse to a cost line?
  • Does the "simple loop + verifier beats bespoke system" result hold only where the verifier is perfect (Lean), or also in noisy-verifier domains (tests, LLM-judge councils)?

Agentic Technical Debt#

  • How long does a CLAUDE.md remain accurate as a codebase evolves? The playbook gestures at session-by-session updates; no data on rot rate.
  • The remedy assumes the founder is able to articulate architecture in plain language. Non-technical founders (the playbook's headline beneficiary group) may have neither the vocabulary nor the intuition to do this well — a recursion failure the playbook doesn't address.
  • Anthropic's harness-shrinkage thesis suggests CLAUDE.md may eventually be inferred by the model itself. Until then, the discipline is load-bearing.
  • Successes cluster where Lean's mathlib is mature and problems decompose into tractable subgoals (combinatorics, convex optimization, number theory). What expands the frontier to problems needing new theory?
  • The agents inherit their LLMs' biases and show high search variance. How do you characterize and push the boundary of what's reachable?
  • The Graffiti result hints at closing the loop between AI conjecturing and AI proving. What does an end-to-end conjecture→formalize→prove pipeline look like?

AI Native Product Cadence#

  • Does the cadence scale beyond ~100 people? Anthropic itself is bigger (~30-40 PMs alone), but the Claude Code team that visibly drives cadence is small.
  • What's the equivalent of research-preview branding for B2B enterprise launches where customers expect stability? Cat doesn't address.
  • How much of the cadence is structural (process choices) vs cultural (talent density)? Probably both, ratio unclear.

The AI-Native Safe-Choice Inversion#

  • The inversion is a one-time repricing of "safe." Once several AI-native ERPs exist, does "safe" re-stabilize around the largest AI-native vendor — and does Campfire's "we're now the largest of the new cohort" claim reflect a land-grab for that position?
  • How long until incumbents bolt on credible AI and neutralize the counter-positioning — and does the custom-foundation-model claim actually defend against that?

AI-Native Startup Lifecycle#

  • The playbook gives no quantitative evidence for the headcount/capital compression claims (no median time-to-PMF, no headcount-at-PMF numbers, no failure-rate data). The "lean 10-person unicorn" is asserted as deliberate target without case-study evidence in the doc itself.
  • Founder stories in the resources section (Carta Healthcare, Anything, Cogent, Airtree, Duvo, Zingage, Kindora, Wordsmith) are short callouts — none have published outcomes or comparable-baseline data.
  • The 42% "built-something-nobody-wanted" CB Insights figure is from a pre-AI era; the playbook predicts the rate will climb but doesn't cite a 2026 measurement.
  • Tension with HBR's accountability findings (above) is unresolved. The playbook's orchestration framing reads as the exact framing HBR's experimental conditions tested against.

AlphaProof Nexus#

  • The framework's reach is gated by Lean's mathlib maturity. What's the path to domains needing new theory rather than subgoal decomposition?
  • AlphaProof adds little as a soloist but helps as a tool. As the prover LLM strengthens, does the AlphaProof tool become redundant entirely?

Building Is Cheap, Arguing Is Expensive#

  • When does "generate three and compare" become wasteful — at what decision weight is a real argument (or a design doc) still cheaper than three implementations?
  • If design discussion lives in PRs/prototypes, where is the rationale recorded for future readers — does the "why we chose this" knowledge survive, or does it share the staleness problem of Code as Source of Truth?

Campfire#

  • Campfire claims its AI edge comes from "our own foundation model." For an ERP, what does a custom foundation model actually buy over fine-tuning a frontier model — and is it durable as frontier models improve (cf. Harness Shrinkage as Models Improve)?
  • "Never had anyone outgrow Campfire" — does that hold as customers reach true enterprise scale where NetSuite's breadth historically mattered?

Claude Character as Product#

  • How is character versioned across model releases? Public commentary doesn't show change-logs at character level.
  • Could character be reproduced by competitors via fine-tuning, or is it path-dependent on Anthropic's internal practice?
  • For non-coding products like Cowork, does the same character work, or does Cowork need its own character tuning?

Claude Code Auto Mode#

  • What false-positive rate does the classifier have on routine-but-aggressive refactors (e.g., large-file renames, rm of build artifacts)?
  • How well does the classifier generalize to custom tools / MCP servers where it lacks environment context?
  • Is the classifier's decision boundary documented/stable enough for security-sensitive orgs to certify, or is it effectively a black box whose behavior drifts with updates?
  • Does extending auto mode to API users change its calibration — is the classifier retrained for automation-heavy use, or held constant?
  • Compared to OS-level sandboxing (mentioned in Claude Code Best Practices alongside auto mode), what's the defense-in-depth story? When should both be layered?

Claude Code Best Practices#

  • What's the optimal CLAUDE.md length before instructions start getting lost? Is there a measurable threshold?
  • How does the Writer/Reviewer pattern compare to agent-to-agent review (as in OpenAI's Codex workflow)?
  • When does subagent overhead exceed the benefit of context isolation?

Claude Opus 4.7#

  • Do Hakim's (2026) brevity-constraint findings on Opus 4.6 replicate on Opus 4.7, or does the literal-instruction-following change the elasticity? Specifically: does <50 words still yield +13.1pp on GSM8K?
  • Does Opus 4.7 still underperform as a planner in HotpotQA-style combo sweeps, or does improved instruction-following close the gap that AgentOpt (Hua et al., 2026) identified?
  • What is the real-world token-inflation multiplier on typical Claude Code sessions (1.0–1.35× is content-dependent — what's the distribution on code-heavy vs. prose-heavy inputs)?
  • How does xhigh compare to max on coding evals? The migration guidance says "start with high or xhigh" — is max ever worth it for coding?
  • What fraction of existing CLAUDE.md / system-prompt hedges become counterproductive under literal instruction following?

Client-Side Agent Optimization#

  • How does combination-level optimization interact with continual model releases? If Claude Opus 4.7 ships next month, does the full Pareto frontier need re-running, or do warm-started bandits adapt cheaply?
  • At what pipeline depth does the combinatorial search become intractable even for Arm Elimination? The paper tests up to ~81 combinations; production pipelines with 5+ roles and 10+ candidate models each blow past that.
  • Does the "weak planner + strong solver" pattern generalize, or is it specific to HotpotQA's delegation dynamic? Recommender-critic, drafter-editor, and retriever-generator topologies might invert.
  • What's the right way to re-evaluate when the tool environment changes? AgentOpt assumes fixed tools — adding or removing a tool potentially invalidates the whole frontier.
  • Is there a cheap per-call classifier that can predict which combination will win on a given query, avoiding combo-level evaluation entirely?

Code as Source of Truth#

  • What knowledge genuinely can't live in the codebase (org strategy, the "why," cross-team context) and therefore still needs a durable doc — and how do you keep that small slice current?
  • If onboarding is "ask Claude," what happens to the tacit knowledge that was previously transferred socially in deep-dives — is it captured anywhere, or quietly lost?

Codex App Server Protocol#

  • How does the App Server protocol compare in detail to MCP? Both expose tools to a model, but App Server is inside the Codex runtime while MCP is outside. When does each win?
  • Is there a public schema registry so external orchestrators can target specific App Server versions without generate-json-schema?
  • The "dynamic tool calls (experimental)" caveat — what's the stability roadmap? Symphony depends on this for its security model.
  • How well does the protocol handle multi-modal turns (image inputs, screenshot attachments)? The spec is text-focused.
  • Is there an analogous protocol on the Claude side, or is Claude's equivalent exclusively the Agent SDK + tool-use API? Comparing the two would clarify when "drive an existing CLI" beats "build on the SDK."

Compounding Data Moat#

  • Is the "two-year replication window" claim defensible empirically, or aspirational? The playbook does not cite measurement.
  • How does this moat hold up when foundation models themselves continue improving rapidly? If a generalist model in 2027 has internalized enough vertical context to handle 340B drug claims natively, does the vertical-edge-case moat erode?
  • The data-flywheel argument has been made for SaaS for 15 years. What's actually different in the AI-native version? Probably: the data improves the model in addition to the product, but the playbook doesn't make this distinction precisely.
  • The "customers build APIs on top of you" lock-in is structurally similar to platform plays (Salesforce AppExchange, Shopify apps). Is the moat type really new, or just newly accessible to lean startups?

Compute Allocator#

  • Is 1% a Thariq-specific number or a regime? For larger, more code-heavy projects the production residue is presumably higher; what sets the ratio?
  • Allocation quality is hard to measure — what's the feedback loop that tells an allocator they spent compute badly (vs. just spending a lot)?
  • Does treating humans as "compute allocators" risk the oversight-fatigue / accountability failure modes the HBR research flags, where the human nominally decides but actually rubber-stamps?

Context Window Smart Zone#

  • Does the smart-zone marker scale with model size, or is it bounded by attention architecture? Pocock observes "the dumb zone has become less dumb lately" but pegs it at 100K through 2026.
  • When sparse-attention or memory-augmented architectures ship, does the smart zone become a soft constraint?
  • How should harnesses surface remaining smart-zone budget to the user — token count, percentage, or a richer signal?

Cowork#

  • How does Cowork's harness compare to Claude Code's? Both surface skills, MCP, sub-agents — but the failure modes for non-code output differ (no test suite, no compiler, no diff to review).
  • What's the eval discipline for Cowork-class outputs? Cat Wu says memory benefits a lot from evals; unclear how slide-deck quality is measured.

Deep Modules for Agents#

  • How big is "deep enough"? Pocock's example modules are several hundred LOC; Ousterhout's textbook examples are larger. There's a sweet spot; not articulated.
  • For ports/adapters codebases, does the deep-module advice transfer cleanly? The "small interface" is the port; the "large behavior" is the adapter. Probably yes, but not exercised in source.
  • Refactor cost vs benefit: when is "improve-code-base-architecture" worth running on a working repo?

Design Concept Grilling#

  • Can grilling be run AFK against another agent that holds the user's preferences? Pocock's answer in 2026 is "no, this part has to be human-in-the-loop" — but the question is open as agents get better at modeling their principal.
  • How does grilling change for team work where multiple humans need to align? Pocock's hint: pair-program with the agent in the room, treat it as a third interlocutor.

Disposable Micro-Apps#

  • Where's the line between a disposable micro-app and tool sprawl? If every edit spawns a bespoke UI, does the workflow fragment?
  • Does the copy-back-to-markdown round-trip generalize beyond config-shaped data (rules, tables) to richer artifacts?
  • Could these micro-apps be templated/reused rather than regenerated — and at what point does that defeat the "disposable" framing and turn into durable tooling?

Dogfooding as Product Discipline#

  • Dogfooding works when the team is the user (Claude Code) or near it (Cat Wu, Boris). How do you build product sense for users very unlike you — does "talk to customers" fully substitute, as Glasgow/Fung's small-business work suggests?
  • Can dogfooding scale, or does it implicitly cap how large an AI-native product org can stay taste-driven before it reverts to dashboards?

Engineer PM Convergence#

  • Does this scale beyond ~50-person Claude Code-style teams? Boris hedges: "I think this is going to be a question for years."
  • What happens to formal PM career ladders in companies where engineers do PM work? Open at Anthropic per Cat.
  • Cross-disciplinary generalist is a hiring bar — where does the supply come from? Career changers, or new-grad bias toward AI-native education?

Evals as Product Spec#

  • How do you write an eval for taste-driven features like character? Amanda's role is canonical for being eval-resistant; Cat names her as someone who is good at evals here, but doesn't describe the technique.
  • The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? Client-Side Agent Optimization's framing of combos suggests evals also have a combinatorial explosion problem.
  • How do evals interact with Harness Shrinkage as Models Improve? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
  • Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet.
  • The LLM-critic fitness is itself an unverified heuristic atop a verified substrate. How often does the Elo ranking mislead the search vs. the cost of computing it?
  • Hyperparameters ($c=0.2$, top-64, $P=7$) were "chosen empirically." How sensitive is the result to them, and do they transfer across mathematical domains?

Founder as Agent Orchestrator#

  • The playbook claims non-technical founders can now build production software, but it does not address the architectural-judgment recursion problem (Agentic Technical Debt): non-technical founders may not have the vocabulary to write effective CLAUDE.md. How does that scale?
  • The "lean 10-person unicorn" is asserted; no quantitative data in the playbook on actual headcount-at-PMF or headcount-at-Series-A medians for AI-native startups vs. the prior cohort.
  • How does the orchestration role change the founder's decision burden? Fewer hands-on tasks but more parallel agent oversight; net cognitive load is unclear and may be higher (see AI Brain Fry).
  • Anthropic publishes both the playbook's anthropomorphic framing and HBR-aware accountability work (auto-mode, alignment) simultaneously without engaging the framing literature directly. The synthesis in Orchestration vs Employee Framing: Reconciling the Founder's Playbook with HBR's Accountability Evidence reconciles the tension at the operational level — orchestration as workflow design preserves accountability; orchestration as mental model of agents-as-coworkers does not — but the open question of why the playbook's marketing language doesn't reflect Anthropic's own framing-discipline work remains.

Founder-Led Sales Discipline#

  • Where exactly does "until PMF" end, and what's the first thing a founder should hand off (AE? agent? both)? Glasgow still does it post-Series-B, suggesting the boundary is fuzzy.
  • Does Glasgow's anti-offload stance generalize, or is it specific to high-trust, mission-critical enterprise sales (ERP) where "they're buying you" — would a PLG/SMB motion delegate to agents far earlier?

Google DeepMind#

  • DeepMind reports its bespoke systems being caught by simple loops. Does the lab's comparative advantage move from systems to models + verifiers + benchmarks (mathlib, Formal Conjectures)?
  • The paper opens AI-for-math; what's DeepMind's next target domain where a sound verifier exists?

Harness Shrinkage as Models Improve#

  • Does all prompt scaffolding eventually migrate into the model, or does some remain — e.g. organization-specific style, security rules, brand voice?
  • The Boris "100 lines" prediction is a year out from May 2026 — testable in 2027.
  • If harness work shrinks, what new work expands to fill it? Cat Wu's bet: PM/product taste, eval-writing, character work.

Hermes Agent#

  • The container backend disabling dangerous-command checks is a defensible design but a meaningful security-model shift. What's the empirical track record? Have lockdown failures in popular images (Daytona, nikolaik/python-nodejs) caused incidents?
  • How do bounded memory files (~2,200 chars MEMORY.md) hold up over long-term use? Auto-consolidation is mentioned but not specified — what's the consolidation algorithm and how lossy is it?
  • Hermes's DM-pairing flow is a clean security primitive. Why hasn't this pattern been adopted by Claude Code or Cursor for shared/team deployments?
  • The split between AGENTS.md (project) and SOUL.md (personality) is explicit in Hermes but implicit in Claude Code's CLAUDE.md. Does the split materially improve outcomes, or is it a documentation choice without empirical backing?
  • Cron jobs in fresh sessions with no memory — how do teams structure the "context the agent needs" without it bloating every cron prompt? Is there a standard pattern?

HTML as the New Markdown#

  • Does the human-facing harness keep growing without bound, or does it hit its own bloat ceiling (an HTML plan too elaborate to read, like the markdown it replaced)? Answered: Does the Human-Facing Harness (HTML Artifacts) Hit Its Own Bloat Ceiling? — yes; HTML raises and reshapes the human-attention ceiling but can't remove it, and the bloat relocates from document-length to artifact-sprawl/rubber-stamping.
  • HTML is heavier to diff and version than markdown — what happens to plan history and review when artifacts are single-file websites? (Disposable Micro-Apps copy-back-to-markdown is one patch.)
  • Does this generalize past one expert practitioner, or does it require Thariq-level fluency with Claude to be worth the overhead?

Interaction Models#

  • Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
  • "Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
  • Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?

Jagged Intelligence (Ghosts, Not Animals)#

  • Karpathy concedes the framing may not have "real power." Is "ghost vs. animal" load-bearing, or a useful intuition pump that doesn't change concrete decisions?
  • If taste/aesthetics/simplicity entered the RL mix, would jaggedness in those dimensions smooth out — or are they too unverifiable to reward cleanly (cf. The Verifiability Thesis)?

Lean#

  • mathlib maturity gates the reachable frontier. Can AI formal proof search grow mathlib (formalize new theory) as a byproduct, expanding its own frontier?
  • Lean is a perfect verifier for math. Which other domains have a comparably sound automatic verifier (vs. only noisy ones like tests or LLM-judge councils)?

Living Design System#

  • How does the design_system.html stay in sync as the codebase evolves — re-extract on a cadence, or wire it into CI?
  • Does a rendered, model-readable design system measurably improve on-brand output vs. a plain CSS/token file, or is the win mostly human legibility?
  • At what project size does maintaining the artifact cost more than the consistency it buys?

LLM-as-Compiler Knowledge Base#

  • At what scale does the no-vector-database approach break down? Karpathy's ~100 articles fit in context, but what about 1,000+?
  • How to handle conflicting information across sources during compilation?
  • What's the optimal granularity for concept articles — one concept per article, or clustered by theme?
  • How effective is the synthetic training data → fine-tuning pipeline in practice?

LLM-Driven Vulnerability Research#

  • How do these capabilities transfer to non-memory-safety bug classes (logic bugs, protocol-level flaws, supply chain attacks)?
  • What's the ceiling for autonomous exploit complexity? The N-day examples are remarkably sophisticated — is there a qualitative limit?
  • How will the security industry's equilibrium shift when multiple labs have Mythos-class models?
  • Can defensive scaffolds (continuous fuzzing + model-driven triage + auto-patching) close the attacker-defender gap during the transition?
  • What safeguards are effective against Mythos-class outputs without crippling legitimate security research?

Managers as ICs#

  • Fung's own open question: "Do you still need separate iOS and Android orgs?" — if engineers flex across platforms via Claude, the traditional platform-split org may dissolve too. How far does flattening go?
  • Does manager-as-IC scale past a certain org size, or only work while Claude Code is small and the codebase is Claude-legible?

MCP and Computer Use#

  • The MCP ecosystem's growth rate vs. computer use's quality curve: at what point does computer use become good enough that the marginal value of building an MCP server drops? Boris implies this is years off but doesn't quantify.
  • Is computer use a sustainable interface or a transition technology? If most knowledge-work software adds MCP support in the next 24 months, computer use's role shrinks to legacy/desktop-only systems.
  • MCP security model: as the playbook prescribes wiring MCP into Salesforce, Gmail, Calendar for solo founders, the attack surface scales with adoption. Not discussed in any source ingested.
  • How does Cowork's computer-use guardrail compare to Claude Code's auto-mode classifier? Different deployment context, possibly different risk profile.

Model Introspection Feedback#

  • How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing.
  • Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing.
  • Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.

Model Spec Science#

  • Does Model Spec science transfer across base models or families? Paper only tests Qwen.
  • Does it survive RL post-training pressure?
  • Can a sufficiently rich General Spec match a Specific Spec? Authors think yes, no demonstration yet.
  • Interaction with situational awareness — if models learn the spec is being used to train them, does that change how MSM-installed values express?
  • How does this interact with Claude character — is the warm/curious personality also subject to spec-science optimization?

Mythos Model#

  • Public release timeline: not in source.
  • Capability profile beyond cybersecurity: Mythos Preview focused on the safety story; other capability dimensions not well-documented externally.
  • Internal access controls: who at Anthropic actually uses Mythos for daily work, vs Opus 4.7? Boris implies infrequent (try-it use); not detailed.

Narrow Wedge into a Legacy Market#

  • A wedge works going in; does it constrain going out? Campfire now serves public companies — at what point does "narrow-but-best" require becoming the broad incumbent it displaced, re-incurring NetSuite's complexity?
  • The wedge-flip shows the first wedge can be wrong. What's the fastest signal that a wedge converts to the core vs. merely sells — Campfire took ~3 months; can it be read sooner?

Outsource Your Thinking, Not Your Understanding#

  • Karpathy's open frontier: can "understanding" itself eventually be automated, or is it definitionally the human residue? His "back in a couple years" hedge leaves it open.
  • If understanding is the bottleneck, is the highest-ROI skill learning how to build understanding fast (knowledge-base hygiene, asking the right projections) — and can that be taught?

Printing Press Software Democratization#

  • Is domain-expert-as-builder actually happening at scale in 2026? Anecdotes (shop owners, microcontroller hobbyists) yes; primary-job software building by non-engineers, less clear.
  • What's the equivalent of compulsory schooling for universal coding literacy? Or does that not happen and we get a long tail of self-taught builders?
  • Boris's "accountant writes accounting software" — does that result in 10K narrow tools that don't interoperate? What's the integration story?

Problem-Solution Fit Discipline#

  • Does asking an AI to argue against an idea actually produce disconfirming evidence at the same rigor as confirming evidence, or does the model still bias toward the framing the founder presents? Worth measuring.
  • The playbook recommends "ask Claude to make the most compelling argument for why a competitor would succeed while you do not." How does this interact with Anthropic's published character training (sycophancy resistance, devil's-advocate willingness)?
  • Has anyone measured 2026 startup failure rates with AI-built products? The "42% will climb" claim is asserted without measurement.

Product Velocity as Moat#

  • Velocity-as-moat is a treadmill: it evaporates the moment a competitor matches pace. What converts Campfire's velocity lead into a structural moat before the AI-native cohort's pace converges?
  • "Never had anyone outgrow Campfire" — is that survivorship (they haven't hit true enterprise scale yet) or a real claim that velocity closes the breadth gap faster than customers grow into it?

Scale-Dependent Prompt Sensitivity#

  • Does the RLHF length-bias hypothesis replicate when tested against base (non-instruct) model variants directly? If verbose generation were primarily pretrained, base-model verbosity differences should match instruct-model differences.
  • What problem characteristics predict prompt sensitivity? An automated classifier would make scale-specific prompting deployable.
  • How does the overthinking effect interact with tool-using agents? If brevity helps large models but tools require structured reasoning, the optimal prompt is not uniformly brief.
  • Do reasoning models (o1, DeepSeek-R1 style) exhibit different overthinking dynamics than instruct models? Their trained behavior is explicitly to generate long CoT — does brevity intervention hurt them?
  • Is BoolQ's functional-elaboration exception a clean taxonomy boundary, or does every task type have a context-dependent optimal length?

Seven Powers Applied to AI#

  • Is "switching cost" really collapsing in practice, or just in narrative? Anthropic's own retention numbers, Salesforce churn, etc. would test this.
  • What does Boris's "cornered resource" look like for foundation-model labs that are themselves trying to commoditize? Internal contradiction or transient phase?
  • Counter-positioning — explicitly the "incumbent can't follow" power — should amplify under AI. Is anyone running this play deliberately?

Software 3.0#

  • Where is the line between "the app shouldn't exist" (MenuGen) and apps that should — i.e., when is deterministic 1.0/2.0 scaffolding still the right call vs. spurious?
  • The neural-net-as-host-process flip is presented as plausible-but-TBD. What would the first production system that genuinely inverts the CPU/NN relationship look like?

Symphony#

  • The 500% landed-PRs claim is hedged — no baseline definition, "on some teams" only. What does the distribution look like across teams? What happens to PR quality and revert rate at that throughput?
  • "Workspaces preserved across runs" is the opposite of typical CI ephemerality. At what point does state pollution from prior runs (stale node_modules, leftover branches, build artifacts) start hurting more than warm-cache helps?
  • Symphony doesn't write to the tracker — agents do. This means tracker policy is a prompt in WORKFLOW.md. How brittle is this in practice when Linear changes its API? How is consistent state-machine behavior enforced when agents have prompt-level discretion?
  • The spec was simplified by being implemented in 6 languages. What's the extension of this technique? Could compiler-prompt.md in this vault be similarly cross-fuzzed?
  • Symphony explicitly says agents can self-create tickets. What governance prevents runaway ticket-graph expansion? Is human triage of agent-created tickets the only check?

Ticket-Driven Agent Orchestration#

  • What's the right granularity for ticket size when the unit is "what one agent does in one workspace"? The post implies "much larger units of work" become viable, but how does that interact with the agent.max_turns limit (default 20)?
  • How do you prevent a ticket-extension cascade when agents file follow-up tickets liberally? Is the only governance check human triage at the Todo-state queue?
  • Does this pattern generalize to non-software work (research, ops, content)? The DAG dependency model and prompt-as-policy file should transfer; the per-issue workspace doesn't obviously.
  • When an agent gets a ticket "completely wrong" (mentioned in the post), how is the lesson fed back into the system? Symphony's answer is "add guardrails and skills" — what's the institutional process for that?
  • How does ticket-driven orchestration interact with sprint planning / OKRs / roadmap work that operates on aggregates of tickets? Does the abstraction collapse when tickets are scoped that small?

The Verifiability Thesis#

  • Where's the boundary of "council of LLM judges" reliability — does it hold for genuinely contested value judgments, or only for quality/coherence?
  • The "labs care" dependency is fragile: capabilities can appear or stagnate based on lab priorities you don't control. How should a product hedge against the data-distribution rug-pull?

Verification as the New Bottleneck#

  • Fung's own open question: "How far do you push fully automated reviews?" — where's the speed/safety balance, and how do you keep humans confident without re-introducing the review bottleneck?
  • If CI/build is the hidden jam, does verification infrastructure (test runners, CI capacity) become the actual capex of an AI-native org?

Vertical Slice Tracer Bullets#

  • Can the planner agent be trusted to slice vertically once told to, or does it need a verifier that flags horizontal slices? Pocock's experience: it needs the verifier, at least through 4.7.
  • How should slice granularity be tuned? Too thin = many merge conflicts; too thick = back to horizontal.

Vibe Coding vs. Agentic Engineering#

  • Karpathy hints at "one domain that's very [valuable]" for founders but won't say which (didn't want to "vague-post on stage"). What verifiable RL-environment domain is he gesturing at?
  • If the mediocre/AI-native spread keeps widening, what does that do to team composition — a few extreme outliers plus agents, vs. broad mid-level staffing?

Zero-Friction Scope Creep#

  • The playbook recommends written scope but offers no template or worked example. How specific does "what we deliberately don't do" need to be to actually block requests?
  • Is there a measurable threshold where scope creep crosses into outright pivot territory? The playbook gestures at "losing direction" without a metric.
  • How does this interact with Cat Wu's 1-day shipping cadence? Anthropic's internal practice ships fast but with strong product judgment; how does that judgment translate for a first-time founder?
§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Related articles