What Is an Agent Harness?

What Is an Agent Harness?

Updated: 2026-04-16 By: Ari Heljakka

Short answer

An agent harness is the runtime that turns a language model into an autonomous agent. It is not a framework you assemble; it is a working system that ships with an iteration loop, context management, tool dispatch, subagent coordination, session persistence, lifecycle hooks, and a permission and safety layer. The harness is what holds the model accountable to a task across many steps. A harness is distinct from an evaluation harness: the agent harness wraps the runtime with the machinery it needs to act; the evaluation harness wraps the evaluators with the machinery they need to score. The two systems compose: the evaluation harness scores the trajectories the agent harness produces.

Key facts

  • Definition: An agent harness is the runtime infrastructure surrounding an LLM that implements the iteration loop, context management, tool execution, subagent coordination, persistence, and safety controls required to operate the model as an autonomous agent.
  • When to use: Any system where an LLM must act over multiple steps, call tools, manage state across turns, recover from failures, and operate under permission constraints.
  • Limitations: A harness is opinionated; the wrong harness for a workload is harder to escape than picking the wrong individual component; harness behaviour shapes evaluation outcomes and must itself be versioned and observed.
  • Example: A coding agent runs an outer iteration loop until the task is complete, calls bash and file-edit tools through a permission gate, compresses context above a token threshold, persists session state for recovery, and dispatches subagents for parallel sub-tasks.

Key takeaways

  • A harness ships as a working agent and a framework asks you to assemble one, which matters because harnesses bias toward defaults while frameworks bias toward configuration.
  • The components of a modern harness are the outer iteration loop, context management and compression, tool and skill dispatch, subagent management, session persistence, system-prompt assembly, lifecycle hooks, and the permission and safety layer.
  • Tools are universal primitives; skills are organisational knowledge encoded for specific workflows. The harness manages both but treats them differently.
  • The harness is itself a versioned, evaluable component. Its behaviour shapes every trajectory the agent produces and therefore every score the evaluation harness records.
  • An agent harness without an evaluation harness produces trajectories no one can grade; an evaluation harness without an agent harness has nothing to score. The two compose into the complete loop.

Definition

An agent harness is the runtime infrastructure that wraps a language model with the machinery required to act as an autonomous agent. The harness owns the control flow that the model itself cannot: the iteration loop that decides when the task is done, the context window that decides what the model sees, the tool dispatch that turns model outputs into external actions, the persistence layer that lets a session survive a restart, the permission gate that decides which actions are allowed.

A harness ships as a working agent. You point it at a task; it works the task. Contrast with a framework, which is a library of primitives you assemble into an agent. Both have their place; harnesses bias toward defaults and out-of-the-box correctness on common tasks, frameworks bias toward configuration for novel topologies.

The agent harness is distinct from the evaluation harness. The agent harness is the runtime that produces trajectories; the evaluation harness is the orchestration layer that scores trajectories. They share an interface (the trajectory data model) but solve different problems: the agent harness is concerned with the agent's behaviour, the evaluation harness is concerned with the agent's quality.

When this matters

The agent harness becomes the critical component of the system when:

  • Multi-step autonomy. The model must decide when to stop, not just respond to one prompt.
  • Tool use under constraints. Tool calls must be dispatched through permission and safety layers, with structured request and response capture.
  • Long sessions. Context window pressure forces compression, summarisation, or selective retention; the model alone cannot manage this.
  • Recovery from failure. Sessions must survive restarts, crashes, or interruptions; persistence is the harness's job.
  • Subagent coordination. Complex tasks decompose into sub-tasks dispatched to specialised subagents with their own tool sets and context budgets.
  • Lifecycle observability. Every step must emit telemetry that lands in the evaluation harness; the agent harness is where the hook points live.

For single-prompt LLM features, a harness is overkill; a thin SDK wrapper is enough. Past any of the conditions above, the harness is the layer where most of the engineering happens.

How it works

A working agent harness has eight components. Each is a property the harness must own; the implementation differs across harnesses, but the responsibility is constant.

Outer iteration loop

The loop that drives the agent step by step until the task is complete. Each iteration the model proposes an action (often a tool call), the harness executes it, the harness adds the result to the model's context, and the loop continues. The loop owns the stopping criterion, which can be a model-emitted "done" signal, a step-count cap, a budget cap, or an error condition. Without the outer loop, the agent is just a chatbot with tool-calling enabled.

Context management and context compression

The harness decides what the model sees on each iteration. Naive accumulation overflows the context window; the harness compresses earlier turns, summarises tool outputs, drops or retains state based on relevance. Compression typically activates above a token threshold; the policy is part of the harness's behaviour. Different harnesses make different compression choices; the choice shapes goal completion and context retention scores.

Tool and skill dispatch

Tools are programmatic, non-semantic, and generally deterministic primitives (read file, edit file, run shell command, query database, call an API): given the same arguments they do the same thing, and they carry no judgement of their own, the model supplies the intent and the tool just executes. Skills are organisational knowledge encoded for specific workflows (a runbook for deploying a service, a coding convention enforcement, a domain-specific procedure). The harness dispatches both: tools through a low-level dispatcher with structured request and response, skills through a higher-level invocation that may itself dispatch tools. Both are observable as spans in the trajectory.

Subagent management

Complex tasks decompose into sub-tasks the harness can dispatch to subagents. Subagents typically run asynchronously, in a restricted tool set, with their own context budget. They cannot usually spawn further subagents (preventing recursive blow-up). The harness manages the parent-child relationship, the input-output contract, and the trajectory shape so subagent work is observable upstream.

Session persistence and recovery

A session that survives a restart is the difference between an agent and a toy. The harness persists state (conversation history, tool results, intermediate plans) to durable storage and can resume from the persisted state after a crash or restart. The persistence model varies (file-based, database-backed, externalised to a control plane), but the responsibility is constant.

System-prompt assembly and project context injection

The harness assembles the system prompt at the start of each session, often combining a base system prompt, project-specific context files, runtime environment information (working directory, OS, available tools), and user preferences. The assembled system prompt is part of the trajectory's provenance and must be versioned.

Lifecycle hooks

Hooks fire at well-known points in the agent's lifecycle: session start, before tool call, after tool call, before model call, after model call, session end. Hooks are the integration point with telemetry, with permission checks, with cost accounting, with the evaluation harness's online sampling. Without lifecycle hooks, the agent is opaque to the rest of the stack.

Permission and safety layer

A permission tier (read-only, workspace-write, full access) gates each tool call. The user grants permissions explicitly; the harness enforces them at dispatch. The safety layer also covers content filtering on model outputs, rate-limiting on expensive tools, and audit logging of every action. The permission layer is the difference between an agent that helps and an agent that causes incidents.

Example

A coding agent for a software team. The harness in operation:

  • Iteration loop. The user asks the agent to fix a failing test. The loop runs: model proposes "read the failing test file" → tool dispatch reads the file → result added to context → model proposes "read the source file under test" → ... → model proposes "edit the source file with this patch" → permission check passes (workspace-write granted) → file edited → model proposes "rerun the test" → bash dispatch runs the test → result shows pass → model emits done. Loop exits.
  • Context compression. Around the eighth iteration, accumulated tool outputs cross the compression threshold; the harness summarises the early read-only tool outputs into a compact note, keeping the recent edits and test outputs in full.
  • Tool and skill dispatch. Tools (read, edit, bash) run through the low-level dispatcher. A skill ("follow the team's commit message convention") fires during the model's commit-message proposal step, injecting the convention as a constraint.
  • Subagent. The user later asks for a parallel refactor across ten files. The harness dispatches three subagents, each handling a slice of the files, with their own context budgets and no permission to spawn further subagents.
  • Persistence. The session writes a checkpoint after each tool call. A mid-session crash is recoverable; the agent resumes from the last checkpoint.
  • Lifecycle hooks. The pre-tool-call hook emits a trace span with the structured request; the post-tool-call hook emits the structured response. The evaluation harness samples those spans for online scoring against per-dimension rubrics (tool-use correctness, goal completion).
  • Permission layer. The agent is granted workspace-write but not full access. Any tool call that would touch outside the workspace is blocked and surfaced to the user.

The same trajectory that the agent harness produces becomes the input to the evaluation harness's nightly run and to the CI gate that scores the regression set.

Limitations

  • Harnesses are opinionated. The choices the harness makes about context compression, tool dispatch, and stopping criteria shape every trajectory. The wrong harness for your workload is hard to escape; component-level swaps inside a harness only go so far.
  • Harness behaviour shapes evaluation outcomes. A harness that aggressively compresses context will produce lower context-retention scores than one that does not, on identical models and prompts. The harness version must be recorded alongside the model version when interpreting scores.
  • Subagent attribution is tricky. Failures in subagents must be attributed correctly in the trajectory; bad attribution hides upstream causes inside downstream symptoms.
  • Persistence introduces state that evaluation must reckon with. Replay-from-state is not always equivalent to original execution; the evaluation harness must handle resumed sessions distinctly from fresh ones.
  • Permission layer interacts with evaluation. An agent constrained to read-only cannot complete a task that requires writes; its goal-completion score in that configuration is artificially low. Evaluation must run under the same permission profile as production.
  • Frameworks-vs-harnesses is a continuum. Many tools sit between the two; the practical question is whether the system works on common tasks without configuration, not whether it carries the harness label.

Evidence and sources

Numeric figures in this post (token thresholds, file limits, step caps) are illustrative; harness defaults vary widely and should be checked against the specific harness in use.

FAQ

How is an agent harness different from an evaluation harness? The agent harness wraps the runtime: iteration loop, context, tools, persistence, safety. The evaluation harness wraps the evaluators: inputs, scoring functions, actions. They share an interface (the trajectory data model) but solve different problems.

How is an agent harness different from an agent framework? A harness ships as a working agent; you point it at a task and it works. A framework is a library of primitives you assemble. The distinction is practical, not categorical: harnesses bias toward defaults, frameworks bias toward configuration.

What is the difference between a tool and a skill? Tools are universal primitives (read, edit, bash). Skills are organisational knowledge encoded for specific workflows (a deployment runbook, a coding convention, a domain procedure). The harness dispatches both, often through different surfaces.

Can I swap harnesses without rewriting the agent? Partially. Tools usually port; skills usually port; permission profiles usually port. The compression policy, the iteration loop's stopping criteria, and the subagent topology often do not port cleanly. Plan for measurable behaviour differences after a swap.

Do all harnesses support subagents? No. Some harnesses are single-agent only by design. Subagent coordination adds complexity (attribution, budget management, no-recursive-spawn invariants) that not every harness ships.

How do agent harness and evaluation harness compose? The agent harness emits trajectories through lifecycle hooks; the evaluation harness consumes trajectories as native inputs. Online sampling, CI regression runs, and incident-driven ad-hoc scoring all run against the same trajectory shape the agent harness produces.

Related reading