Updated: 2026-01-02 By: Ari Heljakka
Short answer
LLM evaluation scores a single input-output pair. Agent evaluation scores a trajectory: a sequence of model calls, tool calls, intermediate plans, and side effects that produce (or fail to produce) a goal-level outcome. Tools built for turn-level grading miss the structural facts that decide whether an agent worked: did it pick the right tool, pass the right arguments, interpret the tool result correctly, and converge on the user's actual intent across the session? Treating an agent like a chat completion produces evaluators that score well in green and fail in production. The fix is structural, not cosmetic: evaluators with trace-level scope, tool-call grading, trajectory metrics, and goal-level outcomes that compose into a per-dimension scorecard.
Key facts
- Definition: Agent evaluation grades a multi-step trajectory against goal-level outcomes; LLM evaluation grades a single turn against per-response criteria. The first is structurally a tree of spans; the second is a leaf.
- When to use: Any system that calls tools, persists state across turns, or executes a plan before producing a final response. Single-turn classification, extraction, or completion remains LLM-evaluation territory.
- Limitations: Trace-level evaluation requires high-fidelity instrumentation, ground-truth labels at the goal level, and judge agreement tracked against the calibration dataset. The cost is real; the alternative is a green test suite that drifts away from production reality.
- Example: A 20-step agent with 95 percent per-step reliability succeeds end-to-end only about 36 percent of the time; per-turn scores can all be green while the goal-level outcome is a failure.
Key takeaways
- Per-step quality and goal-level success are different metrics. High per-step scores do not imply high end-to-end success.
- A 20-step trajectory at 95 percent per-step reliability succeeds end-to-end with probability 0.95^20 ≈ 0.36. Compounding is the dominant failure mode.
- Tool-call correctness is its own dimension. Right tool, right arguments, correct interpretation of the response are three independent failure surfaces.
- Trajectory metrics (steps taken, retries, re-plans, plan-vs-execution divergence) carry quality signal that turn-level grading discards.
- Goal-level outcomes (task completion, faithfulness across turns, user-intent fulfillment) are the only metrics that match the user's experience and must be first-class.
Definition
LLM evaluation grades a single prompt-and-response pair against per-turn criteria: helpfulness, faithfulness, safety, tone. Each judge runs over one input and one output; the unit of work is a turn.
Agent evaluation grades a trajectory: the sequence of model calls, tool calls, retries, re-plans, and intermediate plans that lead to a final user-visible outcome. The unit of work is a session, and the criteria are layered: per-step correctness, per-decision quality (tool choice, tool arguments, response interpretation), and goal-level outcome (did the user's actual intent get fulfilled).
The two categories share vocabulary ("evaluator," "score," "dimension"), but the data model is different. A turn-level evaluator sees one row; a trace-level evaluator sees a tree.
When this matters
Agent evaluation becomes necessary the moment any of these properties hold:
- The agent calls tools whose results shape subsequent decisions.
- The agent persists state across turns within a session.
- The agent emits a plan that is executed across multiple steps.
- The user's success criterion lives at the goal level, not at the turn level (a refund processed, a flight rebooked, a ticket resolved), not "the last reply was polite."
If none of those hold, single-turn LLM evaluation is sufficient. The moment any of them holds, turn-level evaluation systematically under-measures the failure surface.
How it works: the structural differences and the evaluator stack
Turn-level evaluation and trajectory-level evaluation differ in five orthogonal ways. Each shapes the evaluator's data model and lifecycle.
Difference 1, scope (turn vs trajectory)
A turn-level evaluator scores
. A trajectory-level evaluator scores . The scope difference is not a graph view; it is the data model the evaluator operates on. A trajectory evaluator must consume the trace, attribute success and failure across steps, and emit per-dimension scores at multiple granularities (per step, per decision, per goal).
Difference 2, tool-call correctness
For agents that call tools, "correctness" decomposes into three orthogonal questions. Did the agent pick the right tool from the catalog? Were the arguments well-formed and semantically right? And did the agent correctly read the tool's response, including failure cases (empty results, malformed payloads)? Each of those is a distinct failure surface, scored independently and composed into a tool-call quality score. A 0 to 1 normalization per dimension lets the team weight them, alert on drift in any one, and gate per-dimension floors in CI/CD.
Difference 3, trajectory metrics
The shape of the trajectory carries signal that final-answer scoring discards. Step count relative to a baseline tells the team about wasted budget, since a 20-step path where 8 would have sufficed is a usable signal on its own. Retry rate and re-plan rate, when they run high, point at planner-execution divergence or brittle tool integration. Plan-vs-execution divergence asks whether the executor follows the planner's plan or improvises; both are valid behaviours, and the divergence between them is itself a metric. Branch entropy, taken across many runs of the same scenario, measures how varied the paths are and is the place to look for non-determinism.
These are not nice-to-have charts; they are independent evaluation dimensions that detect failures the final-answer score will miss.
Difference 4, goal-level outcomes
Per-turn quality can be uniformly high while the user's actual intent is unmet. Goal-level outcomes are the metric that maps to the user's experience:
- Task completion: did the agent end in the goal state (the refund posted, the ticket closed)?
- Faithfulness across turns: are claims made early still consistent with what the agent says later?
- Intent fulfillment: did the agent address what the user asked, not what the agent inferred?
Goal-level metrics require ground-truth labels on session outcomes, which is more expensive than per-turn labels but more aligned with what the team is actually paid to ship.
Difference 5, compounding reliability
The arithmetic of multi-step reliability is unforgiving. A 20-step pipeline with 95 percent per-step reliability succeeds end-to-end with probability 0.95^20 ≈ 0.36. Per-turn evaluation can show 95 percent across the board while the end-to-end product fails the user nearly two-thirds of the time.
This is not a fault of the per-turn evaluators; it is a category error to use them as the only signal. Goal-level evaluation is the only metric that exposes the compounding failure.
How the evaluation system changes
The evaluator stack for agents has four properties that turn-level stacks do not need:
- Trace ingestion as a first-class input. Evaluators receive structured traces with spans, tool calls, plans, and side effects, not just
. - Multi-dimensional scoring composed into scorecards. Each dimension (tool selection, argument correctness, response interpretation, plan-execution divergence, goal completion) is scored independently and normalized to 0 to 1, then composed with weights into a per-session scorecard. Composition prevents double-counting and lets the team tune trade-offs.
- Versioned calibration datasets at the trajectory level. Ground-truth labels live on session outcomes, not just on final turns. The dataset is versioned, judge agreement against it is tracked as its own metric, and dataset refresh is a recurring operational task.
- Continuous scoring in CI/CD and in production. The same evaluators that gate the PR run on sampled production traces, with score drift treated as a first-class signal that fires alerts and feeds candidates back into the calibration dataset.
The combination is what turns a green test suite into a quality system that survives contact with production.
Example: an agent that scores 95 percent per turn and fails the user
A booking agent handles a flight change in 20 turns. Each turn (parsing intent, calling pricing, calling availability, presenting options, confirming, calling the booking API, generating a confirmation message) scores 95 percent on per-turn quality. The team is satisfied.
Goal-level reality:
- 36 percent of sessions complete the booking successfully (0.95^20).
- The remaining 64 percent fail somewhere: a tool returned a stale availability response, an argument was off by one, a plan re-ran twice and used the wrong fare class on the second attempt.
- Per-turn judges flagged none of it; each turn looked individually fine.
After switching to trajectory-level evaluation:
- A goal-level evaluator scores task completion against a ground-truth label.
- A tool-call evaluator scores selection, arguments, and response interpretation separately.
- A trajectory-shape evaluator flags sessions with three or more re-plans.
- The cluster view groups the failures: most are response-interpretation errors on the availability tool when it returns an empty array under a particular query shape.
The fix is a planner-level assertion (treat empty arrays as "no availability," not as "tool failure"). The per-turn scores barely move; the goal-level completion rate climbs from 36 percent to the high 80s. The change would have been invisible to a per-turn evaluation stack.
Limitations
- Trajectory labels are expensive. Goal-level ground truth requires domain experts to judge full sessions, which is the rate-limiter on dataset growth. Prioritize by cluster impact, not by recency.
- Tool-call grading needs schemas. Argument correctness depends on knowing the schema; semantic correctness depends on the dataset. Both must be maintained.
- Sampling becomes a quality lever. Trajectory evaluation costs more per session than per turn; sampling strategy decides whether the bill or the team controls coverage.
- Judges drift. Trajectory evaluators are themselves LLM-as-judges in many cases; judge agreement with ground truth must be tracked continuously, not certified once.
- Compounding is hard to communicate. Per-turn dashboards look green; goal-level dashboards look red. The translation between them is an organizational problem as much as a technical one.
Evidence and sources
- Anthropic, Test and evaluate, https://docs.anthropic.com/en/docs/test-and-evaluate/, for the rubric-and-judge model behind versioned trajectory evaluators.
- OpenAI, Function calling and tool use, https://platform.openai.com/docs/guides/function-calling, for the tool catalog and structured-response shapes that underlie tool-call grading.
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace-and-span model that trajectory evaluation consumes.
FAQ
Can I keep my per-turn evaluators? Yes. Per-turn evaluators are still useful for monitoring response quality on individual surfaces. The change is to add trajectory and goal-level evaluators alongside, not to delete the per-turn ones.
How do I get goal-level labels without burning out domain experts? Prioritize labeling by cluster impact. Label one representative session per cluster, propagate the label to similar sessions with the judge's agreement check, and reserve human time for the highest-impact clusters.
Is task-completion the only goal-level metric I need? No. Task completion, faithfulness across turns, and intent fulfillment are three independent dimensions. Treat them as orthogonal and compose; do not collapse to a single number.
What is the minimum viable agent evaluation stack? Trace ingestion, three trajectory evaluators (tool-call quality, plan-execution divergence, goal completion), and a versioned dataset of fifty representative sessions. Grow from there.
Where do guardrails fit? Guardrails block bad outputs in real time. Evaluation tells you which guardrails are needed, where they fire, and whether they work. The two are complementary, not interchangeable.