Howardism

Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical architecture enforcement, agent code review

Sources#

Summary#

Agent harness engineering is the discipline of designing environments, artifacts, and feedback loops that enable AI coding agents to do reliable, sustained work across multiple context windows. The core shift: the engineer's job moves from writing code to building the scaffolding that makes agents effective — specifying intent, structuring context, enforcing invariants, and constructing verification pipelines.

Details#

The Fundamental Problem#

AI coding agents work in discrete sessions with limited context windows. Each new session starts with no memory of prior work. Without deliberate harness design, agents exhibit predictable failure modes:

One-shotting — attempting to build everything at once, exhausting context mid-implementation, leaving half-finished undocumented work
Premature victory — seeing partial progress and declaring the job done
Dirty state — leaving the environment with bugs, uncommitted changes, or undocumented progress for the next session to untangle
Incomplete verification — marking features as complete without end-to-end testing

Two-Agent Architecture (Anthropic)#

Anthropic's solution for the Claude Agent SDK uses two specialized prompts:

Initializer agent (first session only): scaffolds the environment — writes an init.sh script, creates a claude-progress.txt log, generates a structured JSON feature list with all requirements marked as "failing," and makes an initial git commit.
Coding agent (every subsequent session): reads progress logs and git history, runs a basic smoke test, picks a single feature to implement, verifies it end-to-end (e.g., via Puppeteer MCP for web apps), commits clean state, and updates the progress file.

The JSON feature list is critical: agents are instructed never to remove or edit feature descriptions, only to flip passes from false to true after verification. JSON was chosen over Markdown because agents are less likely to accidentally overwrite structured JSON.

Repository as System of Record (OpenAI)#

OpenAI's Codex team built a product with zero manually-written code (~1M lines, ~1,500 PRs, 3–7 engineers over 5 months). Their key architectural insight: repository-local, versioned artifacts are all the agent can see — anything in Slack, Google Docs, or people's heads is invisible.

Their approach:

AGENTS.md as table of contents, not encyclopedia: a short (~100 lines) map pointing to deeper docs. A monolithic instruction file fails because it crowds out task context, becomes non-guidance when everything is "important," rots instantly, and resists mechanical verification.
Progressive disclosure: agents start with a small stable entry point and are taught where to drill deeper. Design docs, execution plans, and technical debt are all versioned in-repo.
Mechanical enforcement: custom linters (themselves agent-generated) enforce architecture — dependency direction between layers, structured logging, naming conventions, file size limits. Lint error messages are written as remediation instructions injected into agent context.
Doc gardening: a recurring background agent scans for stale documentation and opens fix-up PRs.

Enforcing Architecture at Scale#

Both sources converge on a key principle: enforce invariants, not implementations. Define strict boundaries (layer dependencies, data validation at boundaries, naming conventions) and let agents have freedom within those boundaries.

OpenAI uses a rigid layered architecture per business domain: Types → Config → Repo → Service → Runtime → UI, with cross-cutting concerns entering through a single Providers interface. This is enforced by structural tests and custom linters. They note this level of architectural rigor is usually postponed until hundreds of engineers — with agents, it's an early prerequisite because constraints enable speed without drift.

Continuous Entropy Management#

Agent-generated codebases accumulate entropy: agents replicate patterns that already exist, including suboptimal ones. OpenAI initially spent 20% of engineering time on manual "AI slop" cleanup. Their solution: encode "golden principles" into the repo and run background agent tasks on a recurring cadence to scan for deviations, update quality grades, and open targeted refactoring PRs. This functions as garbage collection — paying down technical debt continuously in small increments rather than letting it compound.

Harness as Service#

The patterns above describe per-session harnesses. Two 2026 systems demonstrate the natural evolution — harnesses that run continuously as services with per-tenant workspace isolation:

Symphony (OpenAI, March 2026) — long-running daemon polling Linear, per-issue workspace, Codex App Server session per ticket. Same team that authored the OpenAI source above; their explicit "harness as service" iteration. Orchestrator owns workspace lifecycle, retry/backoff, stall detection, and reconciliation; in-repo WORKFLOW.md is the policy file.
Hermes Agent (Nous Research) — Hermes Gateway runs as systemd or launchd; per-user session isolation; allowlist + DM-pairing authorization; cron jobs delivered to a designated home channel.

Convergent design choices across the two:

Daemon-first deployment — long-running service, not per-invocation CLI.
Per-tenant workspace isolation — per-issue (Symphony) or per-user (Hermes).
Container backends as the trust boundary (Docker, Singularity, Modal, Daytona) rather than per-command approval prompting. Hermes explicitly disables dangerous-command checks under a container backend on the principle that "the container is the security boundary."
Repo-versioned markdown as control plane — WORKFLOW.md for Symphony, AGENTS.md/SOUL.md for Hermes, same pattern as the CLAUDE.md/AGENTS.md table-of-contents discipline at session level.
No durable orchestrator DB by default — Symphony explicitly chooses tracker + filesystem for restart recovery; Hermes Gateway state is filesystem-only.

Symphony's evolution sharpens the principle stated above. Their first version treated agents as rigid state-machine nodes — Codex was only asked to implement the task in a ticket. They found this too limiting once models grew capable enough to "create multiple PRs as well as read review feedback and address it," and shifted to giving agents objectives + tools, not state transitions. This is "enforce invariants, not implementations" applied at the orchestration layer (see Ticket-Driven Agent Orchestration).

For the integration boundary between orchestrator and coding agent, Symphony exercises the Codex App Server protocol — JSON-RPC over stdio with continuation turns and dynamic tool calls — which makes the contract explicit and version-tolerant. The protocol's dynamic tool calls feature is also a notable harness primitive: orchestrator-implemented tools can wrap credentials the subagent should never see (e.g., Symphony's linear_graphql tool proxies authenticated GraphQL without giving subagent containers the Linear access token).

The Role of the Human#

In both systems, humans work at a different abstraction layer:

Prioritize work and translate user feedback into acceptance criteria
Design environments and feedback loops
Validate outcomes and provide taste/judgment
When the agent struggles, diagnose what capability is missing (tools, guardrails, documentation) and feed it back into the system — always via the agent, not by writing code directly

Connections#

LLM-as-Compiler Knowledge Base — shares the pattern of repository-local knowledge as system of record, incremental compilation, and LLM-maintained artifacts
Claude Code Best Practices — practical application of many harness engineering principles in Claude Code's environment (CLAUDE.md, skills, hooks, subagents)
LLM-Driven Vulnerability Research — the vulnerability-finding scaffold is a minimal harness: isolated container, single prompt, agentic loop with file-ranking pre-pass and validation agent
Client-Side Agent Optimization — harnesses provide the execution substrate that client-side optimizers then tune via combo selection; the invariants a harness enforces constrain the space AgentOpt searches over
Scale-Dependent Prompt Sensitivity — output-length invariants (via system prompts, schemas, or validators) are a harness-level mitigation for scale-dependent overthinking — fits the "enforce invariants, not implementations" principle
Claude Code Auto Mode — classifier-based tool-call gating is a concrete instance of "enforce invariants mechanically" at the permissions boundary — destructive-action limits enforced pre-execution rather than via advisory prompt
Claude Opus 4.7 — better filesystem-memory reinforces the case for repository-local versioned artifacts as agent memory; task budgets echo the discipline of explicit resource envelopes that harnesses already impose
Symphony — the natural "harness as service" evolution from the same OpenAI team; ticket-as-unit and per-issue workspace are direct extensions of the harness patterns established here
Ticket-Driven Agent Orchestration — orchestration-layer restatement of "enforce invariants, not implementations"; once the per-session harness works, the next bottleneck is which session runs next
Codex App Server Protocol — the integration boundary that makes "harness as service" possible; orchestrator drives sessions through a versioned JSON-RPC contract instead of scraping a CLI
Hermes Agent — parallel daemon-first agent ecosystem; per-user instead of per-issue isolation, with the same container-backend safety pattern; bounded MEMORY.md/USER.md files implement explicit memory envelopes
Context Window Smart Zone — the underlying constraint motivating system-prompt minimalism, AGENTS.md-as-ToC, and reviewer-in-fresh-context discipline; quadratic attention scaling sets the budget every harness operates within
Agent Loop Pattern — the natural session-level primitive once the per-session harness works: drain a Kanban backlog AFK, fragment work into many fresh-context iterations
Vertical Slice Tracer Bullets — planning-layer restatement of "enforce invariants, not implementations" — invariant is "every slice produces visible feedback"
Design Concept Grilling — alignment-layer harness primitive; prevents premature plan generation by forcing pre-plan interview
Deep Modules for Agents — codebase-shape complement: agents in deep-module codebases conserve smart-zone tokens and have natural test boundaries
Harness Shrinkage as Models Improve — division-of-labor between harness and model: prompt scaffolding shrinks with model improvements, mechanical verification stays load-bearing
Model Introspection Feedback — debugging-time tool: ask the model why it failed, fix the harness, not the model
Interaction Models — resolves the harness-vs-model question firmly toward the model for the interaction layer (real-time A/V); VAD / turn-detection / dialog-management harnesses dissolve into model behavior (Thinking Machines Lab, May 2026)
The Bitter Lesson — the principle behind "enforce invariants, not implementations": don't hand-engineer what scaled general capability will subsume

Derived#

Opus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations — applies "enforce invariants, not implementations" and the Writer/Reviewer pattern to an Opus 4.7 multi-agent coding team

Open Questions#

Does a single general-purpose coding agent outperform a multi-agent architecture with specialized testing, QA, and cleanup agents?
How does architectural coherence evolve over years in a fully agent-generated system?
At what codebase scale does the AGENTS.md-as-table-of-contents approach need to be replaced with more sophisticated context routing?
How generalizable are these web-app-focused findings to other domains (scientific research, financial modeling)?

Sources#

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

24 articles link here

ConceptAgent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
EssayOpinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
ConceptClaude Code Auto Mode
Claude Code permission mode using a classifier to auto-approve safe tool calls and block risky ones; middle ground betw…
ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
EntityClaude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
ConceptClient-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
ConceptCodex App Server Protocol
JSON-RPC stdio protocol for headless Codex sessions: initialize/initialized/thread-start/turn-start handshake, continua…
ConceptContext Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
ConceptDeep Modules for Agents
Ousterhout deep-vs-shallow modules applied to agent-friendly codebases; push-vs-pull instruction delivery; reviewer in…
ConceptDesign Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
ConceptHarness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
EntityHermes Agent
Nous Research's CLI agent + Gateway daemon (Telegram/Discord/Slack/WhatsApp); AGENTS.md/SOUL.md context split, bounded…
ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
EssayLearning to Co-Work with AI: A Software Engineer's Field Guide
Field guide for software engineers in the AI era: 6 skill clusters (taste, harness, alignment-first planning, agent-fri…
ConceptLLM-as-Compiler Knowledge Base
Karpathy's architecture: LLM incrementally compiles raw docs into a persistent interlinked wiki, replacing RAG with a 4…
ConceptLLM-Driven Vulnerability Research
Claude Mythos Preview's emergent cybersecurity capabilities: autonomous zero-day discovery, full exploit chains, and An…
ConceptModel Introspection Feedback
Cat Wu's underrated technique: ask the model why it failed; treat answer as harness-debugging signal not model criticis…
EssayOpus 4.6 → 4.7 Changes and Multi-Agent Coding Considerations
4.6→4.7 delta table + six hazards for multi-agent coding teams: role-based model selection, prompt re-tuning, harness i…
ConceptScale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
EntitySymphony
OpenAI's open-source agent orchestrator (March 2026): turns Linear into a control plane for Codex, per-issue workspace,…
ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
ConceptTicket-Driven Agent Orchestration
The inversion that makes Symphony work: tickets as units of work (not sessions/PRs), DAG dependencies, agent-extensible…
ConceptVertical Slice Tracer Bullets
Pragmatic-Programmer tracer-bullet pattern applied to agent task decomposition; vertical slices > horizontal layers; Ka…

ConceptClaude Code Best Practices
Anthropic's guide to effective Claude Code usage: context management, verification-driven development, explore→plan→cod…
ConceptClient-Side Agent Optimization
AgentOpt's framing of developer-controlled agent optimization (model-per-role, budget, routing) as distinct from server…
ConceptAgent Loop Pattern
`/loop` (cron-scheduled) and Ralph Wiggum (backlog-draining) loops as next-generation agent primitive; AFK execution, p…
ConceptHarness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
ConceptDesign Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…

Agent Harness Engineering

Sources#

Summary#

Details#

The Fundamental Problem#

Two-Agent Architecture (Anthropic)#

Repository as System of Record (OpenAI)#

Enforcing Architecture at Scale#

Continuous Entropy Management#

Harness as Service#

The Role of the Human#

Connections#

Derived#

Open Questions#

Sources#