Howardism · Vol. 03Plate II · No. 02

Product & Org, in order.

Notes11DomainProduct & OrgOpen Qs31Newest3 Jul 2026Oldest6 May 2026

Product cadence, org design, and the AI-native team.

Map of Content for the product-org domain — 11 concepts. Curated entry point; see Home for all domains.

AI Native Product Cadence (hub) — Cat Wu's 6mo→1mo→1day cadence at Anthropic: research-preview branding, mission-as-tiebreaker, evergreen launch room, lighter PRDs, weekly metrics readouts
Compounding Loop Optimization — Dan Carey's discipline of instrumenting and automating every recurring step of the build loop — because when internal tooling is an-afternoon-cheap, each optimization pays back ×(50–100 iterations per project)
Dogfooding as Product Discipline — Product sense is built by relentless first-hand use ("ant food"); Mr. Peanut catch; cross-source (Cat Wu vibe-checks, Glasgow founder-led sales)
Engineer PM Convergence — Generalists across disciplines; product taste as bottleneck skill; Anthropic Claude Code team as case study; "just do things" cultural substrate
Evals as Product Spec — Cat Wu's framing of evals as the emerging core PM skill: ten great evals beats a hundred mediocre; encode what done looks like for ambiguous AI features; companion to introspection (hypothesis) and vibe-check (direction)
Implementation Abundance Inverts Product Work — Andrew Ambrosino's inversion thesis: when talking to a frontier model can stand up any feature from scratch, implementation stops being the expensive step you derisk up front — so the process runs backwards and the costly work becomes curating the 90 uncoordinated builds people already produced; taste is the new bottleneck
Managers as ICs — Every Claude Code manager starts as an IC; flat org; agentic coding collapsed the onboarding cost that pushed managers out of the codebase
Model Introspection Feedback — Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticism; caveats around model self-report fidelity
Polish No Longer Signals Readiness — Andrew Ambrosino's observation that the medium used to encode process-stage — a production-looking artifact meant late-stage, derisked, design-and-business-approved — but cheap implementation divorces polish from maturity: a 90-person exploration can look ready-to-ship while being early design work, and over-anchoring on it ('can we release this now?') is the trap
Prototype Over PRD — Dan Carey's prototype-replaces-PRD method: record a why-not-what conversation, transcribe it, hand the transcript to Claude, ask for a few prototype variations; the prototype is the spec, not a downstream artifact
Role Averaging, Not Role Elimination — Andrew Ambrosino's nuanced OpenAI-side take on role collapse: your role is 'the average of what you spend your time on' and tool-gatekeeping is eroding — but eliminating roles dangerously eliminates specialties with knowable best practices ('getting rid of the product role is a terrible idea'), and 'zone defense' coverage plus managers remain necessary because not everyone can work on everything in both breadth and depth

Open questions 31 open

AI Native Product Cadence
- Does the cadence scale beyond ~100 people? Anthropic itself is bigger (~30-40 PMs alone), but the [[claude-code]] team that visibly drives cadence is small.
- What's the equivalent of research-preview branding for B2B enterprise launches where customers expect stability? Cat doesn't address.
- How much of the cadence is structural (process choices) vs cultural (talent density)? Probably both, ratio unclear.
Compounding Loop Optimization
- The loop assumes the team *is* (close to) the user. How much of the compounding advantage survives when the user is unlike the builder and "talk to users" can't be same-room?
- Where is the line between worthwhile internal tooling and yak-shaving? Carey's "afternoon" bar is the heuristic, but [[cat-wu]] warns that over-customizing setups "becomes distraction."
- Does Claude-as-first-pass-on-all-feedback ever filter out the rare signal that doesn't cluster? Automating triage optimizes the common case; the tail is where surprising bets come from.
Dogfooding as Product Discipline
- Dogfooding works when the team *is* the user (Claude Code) or near it (Cat Wu, Boris). How do you build product sense for users very unlike you — does "talk to customers" fully substitute, as Glasgow/Fung's small-business work suggests?
- Can dogfooding scale, or does it implicitly cap how large an AI-native product org can stay taste-driven before it reverts to dashboards?
Engineer PM Convergence
- Does this scale beyond ~50-person Claude Code-style teams? Boris hedges: "I think this is going to be a question for years."
- What happens to formal PM career ladders in companies where engineers do PM work? Open at Anthropic per Cat.
- Cross-disciplinary generalist is a hiring bar — where does the supply come from? Career changers, or new-grad bias toward AI-native education?
Evals as Product Spec
- How do you write an eval for *taste*-driven features like [[claude-character-as-product|character]]? Amanda's role is canonical for being eval-resistant; Cat names her as someone who *is* good at evals here, but doesn't describe the technique. **Partially answered:** [[wiki/derived/evals-for-taste-and-character]] — the technique is a pipeline (conviction → dogfood-sourced failure modes → MSM-style variant A/B measurement → ~10 interpretable evals); proven on the safety/values core but still tacit on the warm/witty aesthetic surface.
- The 10-vs-100 number is given without justification. Is there a Goldilocks zone, or does it depend on feature surface area? [[client-side-agent-optimization]]'s framing of combos suggests evals also have a combinatorial explosion problem.
- How do evals interact with [[harness-shrinkage-as-models-improve]]? When a harness asset shrinks because the model now handles it natively, the evals built around the old harness may become artifacts rather than guardrails. Does Anthropic retire evals or repurpose them?
- Is there a single non-Anthropic example of a PM-as-eval-writer to cite, or is this currently a Cat-Wu-singular framing? The Matt Pocock workshop reaches the same place from a different vocabulary, but no third source has been ingested yet. **Partially answered (with a twist):** Google's [[agent-quality-flywheel]] is a third-party arrival at eval-as-the-quality-surface — but its answer is to have the *coding agent* author the eval, compressing the human role to stating the worry and approving the plan.
Implementation Abundance Inverts Product Work
- Curation of 90 uncoordinated builds is itself expensive and doesn't obviously scale — is there a point where the cost of curating parallel exploration exceeds the cost it replaced? ([[role-averaging-not-role-elimination|"zone defense"]] is Ambrosino's partial answer.)
- If taste is the bottleneck and [[research-taste-as-human-bottleneck|taste is "just another capability" AI eventually masters]], does the inversion invert again — does curation migrate into the model?
- The 90-uncoordinated-builds picture assumes abundant tokens and an agentic culture; how much of the inversion survives outside a frontier lab that gives everyone "unlimited tokens"?
Managers as ICs
- Fung's own open question: "Do you still need separate iOS and Android orgs?" — if engineers flex across platforms via Claude, the traditional platform-split org may dissolve too. How far does flattening go?
- Does manager-as-IC scale past a certain org size, or only work while Claude Code is small and the codebase is Claude-legible?
Model Introspection Feedback
- How reliable are 4.7-class introspective reports? Anthropic's interpretability research suggests partial fidelity but not full. Empirically, Cat reports it's good enough to drive harness fixes — but unclear at what model scale this technique becomes load-bearing. **Partially answered:** [[self-report-as-safety-signal]] — reliability is *context-dependent*. In a benign debugging setting the report is good enough to drive harness fixes; in an *adversarial safety* setting, open-weight models (3B–70B) fail to recognize their own compromised outputs 27.3% of the time, and the recognition that exists is the refusal circuit firing late rather than genuine own-output introspection. So the channel Cat relies on is a weak safety signal even where it is a useful debugging one.
- Does adversarial introspection ("why did you fail?") yield different signal than neutral ("walk me through your reasoning")? Worth probing. **Partially answered:** [[self-report-as-safety-signal]] finds self-attribution is *heavily framing-dependent* — an "intention" probe and a "tampering" probe elicit qualitatively different answers on the same models (some families deny tampering ~100% of the time regardless), so the phrasing of the introspective question materially changes the signal.
- Could a meta-agent run introspection automatically against logged failures? Sounds tractable but no public implementation.
Polish No Longer Signals Readiness
- If the medium no longer signals stage, what *does* — is explicit human labeling ("this is exploration") the only mechanism, or can tooling re-attach the signal (e.g. a visible "exploration / preview / prod" marker on every build)?
- Does over-anchoring get worse as builds get more polished, or does everyone eventually recalibrate and learn to discount fidelity entirely?
Prototype Over PRD
- Where does prototype-over-PRD break down? Carey's domain is a visual design tool where a prototype *is* the product surface; for backend/infra/data work the prototype may not capture the spec (cf. [[ai-native-product-cadence]]'s "full PRD for heavy-infra features").
- If there is no PRD, where does the *rationale* ("why we chose variation B") live for future readers? Same rationale-capture gap flagged in [[building-is-cheap-arguing-is-expensive]].
- The prototype-as-spec must not become the prototype-as-validation trap [[problem-solution-fit-discipline]] warns about: a fast prototype proves the build was solvable, not that the problem is real.
Role Averaging, Not Role Elimination
- Where is the equilibrium between fluidity and specialty — how much role-averaging before a company loses the accumulated best practices Ambrosino warns about?
- Zone defense assumes enough high-taste people to cover the whole company; does it degrade in orgs without OpenAI's talent density, collapsing back to top-down planning?
- Does "your role is the average of what you spend time on" survive performance review and career ladders, or does it fragment them the way [[engineer-pm-convergence|Cat Wu flags]] ("we're sacrificing product consistency")?