Evaluation Harnesses Have an Expiration Date

Evaluation Harnesses Have an Expiration Date

Updated: 2026-02-13 By: Ari Heljakka

Short answer

An agent harness is the runtime loop that surrounds the model: the exit conditions, the iteration caps, the tool-call parsing, the retry budget, the error-recovery heuristics. Every one of those constants encodes an assumption about how the model behaves. The assumption is correct for the model the harness was designed against. As models evolve, the assumption becomes a quiet failure mode the harness cannot see. The 2026 lesson is that harnesses have an expiration date, that the expiration date is invisible without continuous multi-model evaluation, and that the discipline that keeps harnesses fresh is the same discipline that keeps evaluators fresh: versioning, calibration, and an evaluation suite that runs every harness against every supported model on every change.

Key facts

  • Definition: An agent harness is the runtime loop and control structure around the model: when the loop exits, how many iterations are allowed, how tool calls are parsed, how errors are recovered, how outputs are validated.
  • When to use: Whenever an agent is built on top of an LLM with a control loop, which is most production agents. The harness is the silent third leg between the model and the prompt.
  • Limitations: The harness's assumptions about model behavior are usually undocumented and only surface as failures during a model swap. Without an evaluation suite that exercises the harness against multiple models, the failures are diagnosed as model regressions.
  • Example: A loop that exits when the model emits text-without-tool-calls works well for one model and silently passes another model's intent-to-act-without-action as a successful exit. The harness produces a confident, wrong answer.

Key takeaways

  • Harnesses are control logic, not configuration. They embed an empirical model of model behavior that ages.
  • Implicit exit conditions are the most common silent failure. A model that says "I will now do X" and then stops should not trigger a successful exit; many harnesses do.
  • Multi-model evaluation is the only way to find harness assumptions before they hit production. Single-model evaluation hides them by definition.
  • Treat harness tuning constants (iteration caps, exit conditions, retry budgets) as managed configuration in versioned files, not as inline constants.
  • The remediation is structural: version the harness, evaluate it against a benchmark suite on every change and every supported model, and surface harness-attributable failures as a distinct signal from model regressions.

Definition

An agent harness is the runtime structure that wraps a model in a loop, parses its outputs, dispatches its tool calls, and decides when the task is complete. Concrete components include:

  • The main loop that alternates between model invocations and tool execution.
  • The exit condition that tells the loop when the task is finished.
  • The iteration cap that bounds the loop's worst-case cost and latency.
  • The tool-call parser that extracts tool invocations from model output.
  • The retry budget and recovery heuristics for tool failures.
  • The output validator that decides whether the final result satisfies the request.

Every one of these is an empirical decision about how the model behaves under load. The decision is correct for the model the harness was designed against. The model changes; the decision does not. The harness has accumulated an implicit expiration date.

When this matters

The expiration-date problem becomes decisive when:

  • The team operates an agent that uses any non-trivial control loop (most production agents).
  • The team supports more than one underlying model (multi-vendor strategy, ensembles, fallback chains).
  • The model vendor updates the model the harness was tuned against. A point release can shift token formatting, tool-call structure, or refusal behavior.
  • The team is evaluating a model swap and is trying to attribute the regression to the model, the prompt, or the harness. The third candidate is often missed.
  • Tasks are long-running and iteration caps are close to the task's worst case. A model that takes one more step than expected exhausts the cap silently.

For one-shot, single-call use cases without a loop, the harness is trivial and the problem does not apply. For everything else, the problem is present whether the team has noticed or not.

How it works

The classic implicit-finish failure

A common harness pattern is: the loop runs the model, parses the response for tool calls, executes the tool calls, feeds the results back, and continues. The exit condition is implicit: if the model emits a response with no tool calls, the task is treated as complete.

This works for a model that combines narration and action in the same response (it explains and acts, or it explains and concludes). It fails without an obvious error for a model that narrates the next action in one response and only attempts the action in the next response. The harness sees the narration, finds no tool calls, and exits. The task was not completed; the transcript reads as if it had been.

The failure is silent on three axes. It is silent in the logs (the transcript looks coherent). It is silent in single-model evaluation (the model the harness was tuned against does not exhibit the pattern). It is silent in user-visible metrics (the loop exited successfully). The only thing that catches it is an evaluator that scores task completion against the user's original goal, paired with a benchmark that exercises multiple models.

The hidden tuning-constant problem

The implicit-finish pattern is the most cited example; it is not the only one. A non-exhaustive list of harness constants that age:

  • Iteration caps. Tuned against a model that completed tasks in 10 steps; the new model takes 14. The cap fires and the task fails for what looks like a quality reason.
  • Token-budget assumptions. Tuned against a model with a 4K reasoning budget; the new model with a 16K budget produces more output the harness assumed would not fit.
  • Tool-call format expectations. A point release changes the JSON envelope around tool calls. The parser silently strips a field and the tool dispatcher sees a malformed call.
  • Retry policies. A model that previously failed transiently and recovered on retry now fails persistently on a different class of input. The retry budget masks the new failure mode as transient until the bill arrives.
  • Refusal handling. A model that previously refused with a structured response now refuses with a polite mid-text refusal. The harness keeps looping, looking for a tool call that will not come.
  • Context-window assumptions. A model with a larger window starts producing longer outputs the validator (built for a shorter window) parses incorrectly.

Each constant was correct when it was written. None of them are visible in a code review because they look like ordinary configuration. They become failures only when the model on the other side of the harness changes shape.

The cure: continuous multi-model evaluation

The single technique that finds these failures is an evaluation suite that runs the harness against multiple supported models on a fixed benchmark of tasks, on every change to the harness, the prompt, the model, or the tool surface. The suite needs three properties:

  • Diverse model coverage. At least the models the team supports plus one or two adjacent models whose behavior differs in known ways. Single-model evaluation hides the assumptions by construction.
  • Diverse task coverage. Short tasks, long tasks, tasks that fail tools, tasks that produce ambiguous outputs. The harness assumptions surface at the edges of the task distribution, not the centre.
  • Failure-mode attribution. The suite has to distinguish harness failures (silent exits, retry exhaustion, parser drops) from model failures (wrong answer, refusal). Without that distinction, a harness regression is misdiagnosed as a model regression.

A harness scored on this suite produces a few visible metrics: false-finish rate, iteration-cap hit rate, parser-drop rate, plus the per-task correctness numbers. Each is an operational signal. A change in any of them is a diff worth investigating before it hits production.

Version the harness as a managed component

The remediation is structural. Treat the harness the same way the team treats evaluators:

  • Version the harness. Every change to the loop logic, the exit condition, or any tuning constant is a versioned change with a diff and an owner.
  • Externalize the tuning constants. Iteration caps, retry budgets, parser format expectations live in a versioned configuration file, not as inline magic numbers.
  • Tie the version to the model matrix. A harness version declares which models it has been validated against. Running it against a new model is an evaluation event, not an unsupervised swap.
  • Run the suite on every change. Harness, prompt, model, or tool. Any of those changes triggers a re-evaluation against the supported model matrix.
  • Surface harness signals in monitoring. False-finish rate and iteration-cap hit rate are first-class metrics in the live system, not afterthoughts in a quarterly review.

The discipline is the discipline of any other piece of evaluation infrastructure. The harness is part of the system being evaluated; it deserves the same treatment.

Example

A team operates an agent that supports two foundation models, A and B, with a tool surface for retrieval, computation, and a write-back action. The harness exits on first-text-without-tool-call.

Baseline (model A): Correctness 88 percent on a 100-task benchmark. False-finish rate 0 percent. Iteration-cap hit rate 2 percent.

Model swap (model B, same harness): Correctness 71 percent. The team's first hypothesis is a model regression and they begin re-prompting.

Harness evaluation: Running the same benchmark with an adaptive harness (a variant that nudges the model when narration-without-action is detected) raises model B's correctness to 84 percent. False-finish rate on model B with the original harness was 12 percent. The headline regression was 17 points; 13 of those were harness-attributable.

Diagnosis: Model B narrates next actions in a separate turn more often than model A does. The implicit-finish harness was tuned against model A's behavior and silently exited on model B's narration.

Remediation: Switch to the adaptive harness for both models. Iteration cap stays the same. The team adds the false-finish rate as a monitored metric and a deployment gate on the multi-model benchmark suite. The next model release the team adopts is gated on the same suite. Future surprises are detected before deployment, not in production.

Limitations

  • Multi-model evaluation costs money. Running every harness change against every supported model on a real benchmark is not free. Sampling helps; skipping does not.
  • Some failure modes only surface in long tasks. A benchmark of short, simple tasks will not exercise iteration caps or retry-budget exhaustion. Include long-task representatives.
  • The harness can be entangled with the framework. When the framework owns the loop, replacing the harness is not a local change. Framework-coupled designs trade harness flexibility for day-one velocity; the trade is real.
  • Some constants are critical to cost. Iteration caps protect against runaway behavior. Loosening them to accommodate a slower model can mask a different problem. Treat changes as deliberate.
  • Vendor model documentation lags behavior. A model update that changes tool-call envelope or refusal style may not be documented in the release notes. The benchmark suite is the protection.
  • An adaptive harness is not a free lunch. It catches the failure modes it was designed for; new modes still require evaluation to surface.

Evidence and sources

FAQ

How often do harnesses really break? Every meaningful model update is a candidate event. In practice, the breakage is rare but high-impact: when it happens, it is silent and attributed to the wrong cause.

Should I always use an adaptive harness? Adaptive harnesses cost slightly more in tokens and complexity. They catch the most common silent-exit failure mode. For agents that support multiple models or are likely to swap models, the trade is usually worth it.

What is the simplest first step? Add a false-finish detector to your harness benchmark and report the rate per model. That single metric surfaces the most common silent failure cheaply.

Does this matter for a one-model deployment? Less, but not zero. Even a one-model deployment faces vendor point releases. A benchmark suite that runs the harness on every model update catches the same class of failure.

What is the relationship to prompt engineering? The harness, the prompt, and the model form a triple. A prompt change can mask a harness regression; a harness change can mask a prompt regression. Versioning all three and evaluating their combination is the only way to attribute failures correctly.

Can the harness be inferred from logs? Partly. A reproducible failure-mode trace plus the harness version plus the model identifier is enough to reproduce. Without versioning, the same trace cannot be re-run with confidence.

Related reading