Where Does Agent Harness Work Remain Durable as Models Improve?

Short answer#

Durable harness work remains at the boundaries between model behavior and external reality: repo-local source of truth, mechanical verification, context-window budgeting, isolation/security boundaries, tool contracts, and human decision surfaces. What shrinks is prompt scaffolding that teaches the model behaviors the next model can perform natively: aggressive to-do-list prompting, explicit loop invocation, brittle state-machine orchestration, and advisory reminders.

The rule: capability scaffolding shrinks; boundary enforcement persists. A better model can learn to choose a to-do list, start a loop, or follow a workflow without being nagged. It still needs a current repo to read, tests to pass, tools with permissions, a context budget, and a place to write durable state.

What does not stay durable#

Harness Shrinkage as Models Improve gives the negative space. Early Claude Code needed aggressive prompting to use a to-do list; later models used it spontaneously, so the prompt crutch was deemphasized while the tool stayed for user-facing visibility. Boris Cherny's "100 lines of code a year from now" claim is hyperbolic, but the direction is supported: prompt sections, fallback logic, and safety reminders get deleted when the model handles the behavior natively.

The same pattern appears in loops. A /loop primitive can start as harness functionality, then migrate into model behavior when Opus 4.7 notices changing data and offers to report periodically without being explicitly told. The durable part is not "teach the model to run a loop"; it is the substrate that makes a loop useful and bounded: workspace, schedule, output location, permissions, and verification.

So the disposable harness layer is:

Capability injection: prompts that teach generic behaviors the next model may internalize.
Over-specified workflows: rigid state transitions where "objective + tools" is enough for a stronger agent.
Always-loaded explanation: giant CLAUDE.md / AGENTS.md content that burns the Context Window Smart Zone before work starts.
Advisory rules pretending to be enforcement: reminders that should be hooks, tests, linters, schemas, or permission boundaries.

Durable layer 1: repo-local source of truth#

Agent Harness Engineering and Code as Source of Truth converge on the same invariant: the agent can only reliably use what is in the repository or filesystem it can read. Slack threads, stale Google Docs, and undocumented architectural memory are not harness; they are missing input.

OpenAI's Codex harness treats AGENTS.md as a table of contents, not an encyclopedia. Fiona Fung's AI-native-org rule is stricter: when code changes fast, external docs decay, so specs and skills must be checked into the codebase. That is durable because model improvement increases coding throughput, which makes out-of-repo documentation rot faster, not slower.

This work remains valuable even if the model gets much smarter:

Keep project context small, versioned, and local.
Store specs, skills, workflow rules, and architectural decisions where diffs can update them.
Make the codebase teachable by Claude, rather than relying on colleagues to replay tacit context.
Prefer progressive disclosure: top-level context points to deeper files; nested context is pulled when relevant.

A smarter model can infer more from code, but it cannot infer a current decision that was never recorded.

Durable layer 2: mechanical verification#

The strongest durable claim is verification. Harness Shrinkage as Models Improve explicitly separates shrinking prompt scaffolding from load-bearing mechanical verification. Claude Code Best Practices makes this operational: give Claude tests, screenshots, expected outputs, linter commands, and error messages, or the human becomes the only feedback loop.

Agent Harness Engineering names the general principle: enforce invariants, not implementations. Durable harness work defines the boundaries the model must respect and lets implementation vary inside them. Examples from the covered pages:

JSON feature lists where agents may flip passes: false to true only after verification.
Structural linters enforcing dependency direction, naming, logging, and file-size limits.
Hooks for deterministic actions that CLAUDE.md can only request.
Spec-drift checks against committed specs.
Fresh-context reviewer sessions instead of asking a tired implementer session to review itself.

This is not a crutch for weak models. It is the contract with reality. Better models reduce how often verification catches mistakes; they do not make verification obsolete.

Durable layer 3: context-window economics#

The Context Window Smart Zone is a durable constraint until the underlying attention/memory architecture changes. The wiki's evidence is not "long context is useless"; it is sharper: long context can retrieve, but reasoning degrades as sessions cross the smart-zone threshold. That makes context budgeting harness work, not prompt decoration.

Durable context work includes:

Keep always-loaded files short.
Use skills / on-demand docs instead of pushing every rule into the system prompt.
Clear between unrelated tasks when state can be resumed from written records.
Use subagents for broad investigation so only summaries return.
Split work into vertical slices and loops so each session starts fresh.
Review in a clean context.
Surface token usage with status lines or equivalent accounting.

Model improvements may make the dumb zone less dumb, but they also increase output volume and agent autonomy. The need to bound what any one session must reason over remains.

Durable layer 4: isolation, permissions, and deployment boundaries#

The durable security boundary is not "the model promises not to do something." It is the environment that decides what can happen. Hermes Agent makes this explicit: under container backends, dangerous-command checks are skipped because the container is treated as the security boundary. That shifts load from per-command approval prompts to image discipline and sandbox quality.

Across Agent Harness Engineering, Claude Code Best Practices, and Hermes Agent, the stable work is:

Per-user or per-issue workspace isolation.
Container, VM, or OS-level sandboxing.
Allowlist / pairing authorization for multi-user deployment.
Scoped tools and credentials.
Permission modes and classifiers where they enforce pre-execution boundaries.
Daemon lifecycle management for always-on agents.
Restart recovery through tracker + filesystem state.

As models improve, advisory permission text should shrink. The need for a blast-radius boundary does not.

Durable layer 5: tool contracts and orchestration substrate#

Tool-dispatch prompting may shrink; tool contracts remain. A model can get better at choosing MCP, shell, browser, or a subagent, but those actions still need APIs, credentials, schemas, sandboxes, and result surfaces.

The durable orchestration work is therefore substrate, not choreography:

Stable non-interactive entry points (claude -p, Hermes CLI, Codex App Server style protocols).
Daemon-first runtimes for scheduled or team-scale work.
Per-task or per-user tenancy.
Filesystem-backed state when no orchestrator DB is necessary.
Cron or issue-tracker dispatch with complete context for fresh sessions.
Credential-wrapping tools so subagents can act without seeing secrets.

Symphony's shift from rigid state-machine nodes to "objectives + tools" is the clean example in Agent Harness Engineering. The brittle state machine shrinks. The tracker, workspace lifecycle, retry/backoff, stall detection, and reconciliation remain.

Durable layer 6: human decision surfaces#

The human role in Agent Harness Engineering is not typing more code. It is prioritizing work, translating feedback into acceptance criteria, designing feedback loops, validating outcomes, and diagnosing missing tools or guardrails. Better models do not delete that boundary; they push more work through it.

This is where durable harness overlaps with Code as Source of Truth: the human's decisions must become repo-local artifacts, not vibes in a chat. Otherwise every fresh session re-derives architecture and compounds agentic debt.

Durable human-facing work:

Acceptance criteria the agent can test.
Specs checked into the repo.
Backlogs sliced so each task is independently verifiable.
Review surfaces that show diffs, screenshots, failing tests, and spec drift.
Pruning discipline: read prompts/context at model launches and delete what no longer earns tokens.

The human should not preserve old harness out of nostalgia. The human should preserve the boundary artifacts that make stronger models safe to use at higher throughput.

Practical test#

For any harness component, ask one question:

If the next model got much smarter, would this become unnecessary, or would it become more important because the agent can now do more damage faster?

Likely to shrink:

Prompt reminders for generic planning behavior.
Forced tool-use nudges.
Rigid step-by-step orchestration.
Repeated safety prose.
Large always-loaded explanations of things visible in code.

Likely durable:

Tests, linters, schemas, hooks, and spec-drift checks.
Repo-versioned context files and skills.
Workspace isolation and credential boundaries.
Token/context accounting.
Subagent and fresh-review patterns.
Tool protocols and daemon lifecycle.
Acceptance criteria and human review surfaces.

Bottom line#

Durable agent harness work is the work that turns model capability into bounded, inspectable, repeatable effects in the world. Prompt scaffolding is a wasting asset; every model launch should put it on trial. Boundary design is not wasting. The stronger the model, the more important it becomes that state, verification, tools, permissions, and human decisions live in structures the model can read and the system can enforce.

Harness Shrinkage as Models Improve — model-facing scaffolding shrinks; mechanical verification stays load-bearing.
Agent Harness Engineering — repo-local artifacts, progressive disclosure, invariant enforcement, and harness-as-service.
Claude Code Best Practices — practical workflow rules for verification, context, subagents, hooks, and scaling.
Hermes Agent — daemon-first, per-user, bounded-memory, container-boundary variant of the same harness pattern.
Context Window Smart Zone — why context budgeting remains a harness responsibility.
Code as Source of Truth — why durable context must live in the repo as coding throughput rises.