How do you evaluate CLI-based coding agents?

How do you evaluate CLI-based coding agents?

Updated: 2026-02-08 By: Ari Heljakka

Short answer

CLI-based coding agents change code at a speed that outpaces manual review. The reliable pattern wraps the agent in an evaluation harness: a structured way to trace every change the agent proposes, score the proposal on multiple dimensions, and either accept, reject, or gate based on the scores. The harness has three orthogonal components (inputs, evaluators, actions) and the same agent improves dramatically once it can read the harness's feedback. Without a harness, a CLI coding agent is a fast guesser. With one, it is a measurable contributor whose outputs can be trusted to land in production code.

Key facts

  • Definition: A CLI coding-agent evaluation harness is the structured wrapper that traces the agent's changes, scores them against measurable criteria, and turns the scores into operational actions (accept, reject, gate, escalate).
  • When to use: Any team using a CLI coding agent on a non-trivial codebase, especially where the agent's changes may land in production or affect downstream consumers.
  • Limitations: Requires a curated regression set for the codebase; LLM-as-judge evaluators must be calibrated against human review; the harness adds latency and cost per agent action.
  • Example: A team's CLI coding agent ships 14 PRs in a week; the harness blocks 3 with regressions on a versioned eval suite and accepts 11 with per-dimension scores attached; the team's review time drops 70 percent.

Key takeaways

  • A CLI coding agent without an evaluation harness is a confident generator. With a harness, it is a measurable contributor.
  • The harness has three orthogonal components: inputs (what is evaluated), evaluators (how it is scored), actions (what the scores trigger).
  • Evaluation dimensions for coding agents are concrete: factuality (does the code do what it claims), test pass rate, regression delta, latency, safety, policy adherence.
  • The harness is a feedback loop the agent itself can consume. Surfaced scores make subsequent agent passes smarter, not just safer.
  • The harness is reusable across CLI agent families. The contract is the trace schema and the evaluator interface, not the agent vendor.

Definition

A CLI-based coding agent is an agent that reads, writes, and edits code through a terminal interface, operating against a local repository or a connected development environment. The category covers a growing set of tools from several vendors, all built around the same workflow: the agent reads files, proposes edits, runs commands (tests, builds, linters), and iterates.

An evaluation harness is the structured wrapper around the agent that records every action, scores every proposed change, and either accepts the change, rejects it, or surfaces it for review. The harness is not the agent; it is the measurement infrastructure the agent runs inside.

The harness exists because CLI coding agents generate changes faster than teams can review them. Without measurement, the team is trading review speed for production risk. With measurement, the trade-off becomes explicit and bounded.

When this matters

The harness earns its keep when at least two of these hold:

  • The CLI agent is touching production code, not a personal scratchpad.
  • The agent operates over a codebase large enough that a human cannot read every change.
  • The team ships multiple agent-driven PRs per week.
  • The cost of an undetected regression (build break, test failure, performance drop, security issue) exceeds the cost of the harness.
  • Multiple agents are in use (one team prefers one CLI; another prefers another) and the team wants comparable quality signals across them.

For one-off scripts or personal automation, the harness is overhead. For team-wide adoption of CLI coding agents, it is essential.

How it works

The harness has three structural components and one operational loop. Each component is described as a property to aim for, not a tool to install.

Component 1, the inputs

What the evaluator consumes. For a coding agent, inputs are:

  • Traces. The full sequence of agent actions: files read, files written, commands run, model calls made.
  • Diffs. The proposed code change as a structured payload, not as free text.
  • Test results. Pass and fail counts before and after the change, plus the names of newly-passing and newly-failing tests.
  • Build and lint results. Pre- and post-change snapshots.
  • Datasets. Curated regression suites and held-out test sets the harness runs against.

The inputs are structured. An input layer that consumes free text is leaking schema that the agent already had.

Component 2, the evaluators

How the inputs are scored. For a coding agent, evaluator types and the dimensions they score:

  • Deterministic checks. Test pass rate, lint clean, build green, type-check green. Each scores 0 or 1 (or normalized 0 to 1 for fractional pass rates).
  • LLM-as-judge evaluators. Code-quality dimensions (readability, idiomatic style, maintainability), code-intent alignment (does the diff do what the commit message claims), and refusal correctness (did the agent decline tasks outside its scope when it should have).
  • Embedding-similarity checks. Does the diff stay within the part of the codebase the agent was scoped to touch.
  • Custom scoring functions. Performance benchmarks, memory benchmarks, security scans, regulatory checks specific to the codebase.

Each evaluator scores 0 to 1. Composition is explicit; per-dimension floors apply.

Component 3, the actions

What the harness does with the scores:

  • Auto-accept. Above all per-dimension floors: merge directly or accept the change.
  • Gate for review. Within a tolerance band: surface to a human reviewer with the dimension scores attached.
  • Reject. Below any hard floor: roll back the change and feed the failed case into the dataset.
  • Alert. On regression patterns: notify the team when a dimension's score is drifting across multiple changes.
  • Escalate. On policy or safety violations: route to a designated owner.

The actions close the loop. Scoring without action produces dashboards; the harness's value is in the operational consequences.

The operational loop

The agent and the harness compose into a five-step loop:

  1. Agent proposes a change. Reads files, proposes a diff, runs preliminary checks locally.
  2. Harness traces the action. Captures the diff, the model calls, the test results, the build state.
  3. Harness scores the change. Runs the evaluator suite against the inputs.
  4. Harness takes action. Accept, gate, reject, alert, or escalate based on the scores.
  5. Agent receives feedback. The scores flow back into the agent's next attempt, either as in-context information or as a structured prompt update.

The fifth step is what makes the harness more than a gate. An agent that can read the harness's scores improves faster than one that cannot.

Dimensions for coding agents

The dimensions that consistently surface signal:

  • Test pass rate, before and after. Deterministic; the most informative single signal.
  • Regression delta. Of the tests that passed before the change, how many still pass? Of the tests that failed, how many now pass?
  • Code-intent alignment. Does the diff implement what the commit message and the user intent claimed?
  • Style and convention adherence. Does the diff match the codebase's existing patterns?
  • Performance and memory regression. For codebases where performance matters, scored against a benchmark suite.
  • Safety. No secrets in the diff, no plaintext credentials, no obvious injection points introduced.
  • Policy adherence. Codebase-specific rules: licensing compliance, dependency policies, security review requirements for sensitive paths.
  • Latency of the agent itself. How long the agent took to produce the change, useful for cost and developer-experience tracking.

All dimensions normalize to 0 to 1; per-dimension floors enforce the non-negotiable ones (test pass rate, build green) and the soft ones (style) gate to review instead of rejecting.

Example

A team adopts CLI coding agents for routine maintenance and refactoring. The team wraps the agents in a harness:

  • Inputs. Every agent action emits a structured trace; the diff and commit message are first-class fields; pre- and post-change test results are recorded.
  • Evaluators. Deterministic: test pass rate, lint clean, build green, type-check green. LLM-as-judge: code-intent alignment, style adherence, refusal correctness. Custom: dependency-policy check, secret scanner, performance regression on a 12-test benchmark suite.
  • Actions. Above all floors: auto-merge. In the tolerance band: surface to review with scores attached. Below test-pass floor or build floor: reject and roll back. Style drift across three consecutive PRs: alert the team lead.

Over a typical week the team's harness sees 14 PRs from CLI agents. The breakdown:

  • 11 auto-merged. Average code-intent alignment 0.94; average style adherence 0.89.
  • 2 surfaced for review. Reviewer ratifies one as-is, edits the other before merging.
  • 1 rejected for a regression on a previously-passing test. The case enters the dataset; the next agent pass on the same task succeeds because the agent reads the failure score from the previous attempt.

The team's review time drops 70 percent because the harness reads what no human had time to read. The team's regression rate per release drops to a fraction of what it was, because every accepted change cleared the gate.

Limitations

Caveats worth flagging up front:

  • The dataset is the work. A curated regression suite for the codebase is the asset that makes the harness meaningful; without it, the harness scores against nothing.
  • LLM-as-judge calibration is per-codebase. Style and intent judges trained on one codebase do not transfer cleanly to another. Treat calibration as a per-codebase operational task.
  • Latency and cost compound. Each agent action adds evaluator calls. Pin smaller models for cheap evaluators; reserve frontier judges for the dimensions that matter.
  • Agents differ in tracing fidelity. Some CLI agents emit structured traces natively; others need wrapper instrumentation. Plan for the integration cost per agent.
  • The harness can over-fit. Tight floors on every dimension block legitimate trade-offs (a performance gain that fails a style check is not always wrong). Treat the harness as a guardrail, not a ceiling.

Evidence and sources

FAQ

Does the harness work for any CLI coding agent or just one? For any. The contract is the trace schema and the evaluator interface, not the agent vendor. Different agents emit traces with different fidelity; a wrapper layer normalizes them. The dataset, evaluators, and actions are the same.

How big does the regression dataset need to be? Coverage matters more than size. 100 representative cases covering the codebase's hot paths and known failure modes is more useful than 1000 near-duplicates. Grow the dataset by failure cluster, not by recency.

Should the harness block on style scores? No. Style is a tolerance-band dimension; the harness surfaces for review rather than rejects. Hard blocks belong on test pass rate, build green, and safety floors.

How does this compare to a human code review? The harness is faster and cheaper but blind to the things a human reviewer catches by reading intent. The two are complements: the harness handles deterministic and calibrated dimensions; the human handles intent and judgment.

Can the agent itself read the harness's output? Yes, and it should. An agent that consumes its own per-dimension scores from previous attempts iterates more efficiently. The feedback loop is the harness's most underused property.

Related reading