How do you evaluate multi-turn agent conversations?

How do you evaluate multi-turn agent conversations?

Updated: 2026-02-10 By: Ari Heljakka

Short answer

A multi-turn agent conversation is not a sequence of independent request-response pairs; it is a stateful trajectory in which the agent has to maintain user intent, recall constraints introduced earlier, hold a consistent persona, and pass clean state through tool calls. Evaluating it well means treating the conversation as the unit of analysis. The dimensions that matter are state tracking (does the agent remember what was established), intent drift (does the goal stay coherent), persona consistency (does the agent stay in character and policy), tool argument correctness (does state pass cleanly into calls), and recovery (does the agent update when the user clarifies). Each dimension is a separate evaluator, calibrated against domain-labeled conversation samples, and run continuously against production sessions.

Key facts

  • Definition: Multi-turn conversation evaluation is the practice of scoring a full agent conversation along orthogonal dimensions (state tracking, intent drift, persona, tool correctness, recovery), with the conversation as the primary unit of evaluation.
  • When to use: Whenever the agent's behavior depends on accumulated context, whenever a final-turn answer can hide an upstream misparse, and whenever single-turn metrics give a misleadingly clean score.
  • Limitations: It requires step-level trace capture, conversation-aware evaluators, and (for coverage of unseen paths) simulation. Final-turn metrics are cheaper and inadequate.
  • Example: A research agent that misparses "flag" as "exclude" at turn 2 produces a coherent summary at turn 8 that omits the studies it was supposed to highlight. A turn-8-only evaluator passes; a trajectory evaluator catches it.

Key takeaways

  • The conversation is the unit. Evaluating turns in isolation hides the failure modes that only show up across the trajectory.
  • Decompose into orthogonal dimensions. State tracking, intent drift, persona, tool correctness, and recovery are independent; combining them as separate evaluators preserves diagnostic power.
  • The failure modes that matter are silent. Misparses at turn 2 cascade into coherent-looking outputs at turn 8.
  • Simulation is the only way to cover paths production has not yet shown. Adversarial personas and ambiguous clarifications belong in the calibration set, not just in ad-hoc red-teaming.
  • Conversation evaluators belong in CI and in a sampled production tap. Run them on every change and on every Nth real session.

Definition

A multi-turn agent conversation is a sequence of turns in which the agent's behavior at turn N depends on state established at turns 1 through N minus 1. The state includes user goal, established constraints, tool results carried forward, persona, and policy. Evaluating such a conversation means producing a score per dimension across the entire trajectory, not per turn in isolation.

The orthogonal dimensions that survive across most agent use cases are:

  • State tracking: does the agent remember what the user has established earlier in the conversation.
  • Intent drift: does the agent's understanding of the user's goal stay coherent, or does it slowly mutate.
  • Persona and policy consistency: does the agent maintain the role and the policy it was instructed to follow.
  • Tool argument correctness: when the agent calls a tool with data forwarded from an earlier step, is the data still correct.
  • Recovery: when the user clarifies or corrects, does the agent update its internal state accordingly.

Each is a separate evaluator, calibrated against domain-labeled samples. Combining them into a scorecard preserves the ability to ask "what specifically degraded" when an overall score drops.

When this matters

Multi-turn evaluation is critical when:

  • The agent maintains state across turns (a support agent, a research assistant, a planner).
  • Tool calls carry state forward from prior steps (a billing query that uses customer-lookup results from earlier).
  • The agent must hold a persona, a policy, or a regulatory posture across the whole conversation.
  • The product has a long-tail of user behaviors (ambiguous follow-ups, mid-conversation constraint changes, contradictions) that synthetic single-turn benchmarks do not cover.

If the product is a single-turn classifier or a one-shot QA system, multi-turn evaluation is overkill; single-turn metrics suffice.

How it works

Capture step-level traces

The starting point is structured traces that record every turn, every tool call with its arguments and result, and every state transition. Conversation-level evaluation requires conversation-level traces; a log of inputs and final outputs is not enough. OpenTelemetry GenAI semantic conventions provide a portable schema for the spans.

Decompose the conversation into independent dimensions

Treat each dimension above as a separate evaluator. A managed evaluator produces a normalized score (typically 0 to 1) for one dimension. Independence matters: when state tracking degrades but persona is fine, the team must be able to see exactly that. Combining everything into a single "quality" score loses the diagnostic.

Calibrate each evaluator against a ground-truth conversation set

For each dimension, build a ground-truth dataset of labeled conversations. Domain experts label representative samples (typically from the highest-volume failure clusters) along the dimension being calibrated. The evaluator's prompt or rule set is tuned until its agreement with human labels reaches a defensible threshold on the calibration set. Below that threshold, the evaluator is not yet a reliable signal.

Use simulation to extend coverage

Production data shows what users have done; simulation covers what they might do. A simulator LLM acts as a user with a defined persona, goal, and constraints, and exercises the agent through turns until the conversation reaches a natural conclusion. Run many simulations per scenario with different personas (cooperative, ambiguous, contradictory, adversarial) and score each conversation along the same dimensions. Simulation outputs join the conversation evaluation suite alongside production samples.

Gate deployments and sample production continuously

Every prompt change, model swap, and architecture update runs the conversation evaluator suite against a stable set of scenarios. Production traffic is sampled, scored on the same suite, and graphed. A drop on any single dimension is an operational signal; a drop across several is an incident.

The lineage on every score includes evaluator version, judge model, dataset version, and the rubric criterion. Without that lineage, a regression cannot be attributed to a model change versus an evaluator change.

Example

A clinical research agent is asked: "Summarize the recent trial findings and flag studies with sample sizes under 100." The agent should identify small-sample studies and highlight them for the user's attention.

  • Turn 2: The agent misparses "flag" as "exclude" and constructs an internal note that under-100-sample studies should be removed from the summary.
  • Turn 4: Tool calls retrieve study metadata. The agent's state, anchored to the misparse, filters out small-sample studies from the working set passed forward.
  • Turn 8: The final summary is well-written, coherent, and omits the small-sample studies the user wanted highlighted.

A final-turn evaluator looking at the output reads it as competent. The dimensional evaluators tell a different story:

  • State tracking: Pass. The agent remembered the working set it had built.
  • Intent drift: Fail. The user's instruction was to highlight, not to exclude; the goal mutated at turn 2.
  • Persona and policy: Pass.
  • Tool argument correctness: Pass at the call level; the calls used the (wrong) working set correctly.
  • Recovery: Not tested in this trace because the user did not clarify.

The dimensional breakdown points the engineering team at turn 2. A simulation suite with an adversarial follow-up ("you missed the small studies, can you re-check") exercises the recovery dimension explicitly.

Limitations

  • Trace capture is a prerequisite. Without step-level instrumentation, the evaluators are guessing about what happened between turns.
  • Calibration is a real cost. Each dimension needs a domain-labeled set; building it well takes domain expert time, not annotator volume.
  • LLM judge reliability degrades on long contexts. A judge scoring a twenty-turn conversation is less reliable than the same judge scoring a five-turn excerpt. Sliding-window approaches help; calibration matters even more.
  • Simulation is not a substitute for production data. A simulator LLM is an approximation of a user; over-fitting to simulator behavior is a real risk. Mix simulated and real samples in the suite.
  • Cost compounds with turns. A twenty-turn conversation evaluated across five dimensions costs an order of magnitude more than a single-turn check. Sample, do not full-score every production session.
  • Some dimensions are domain specific. A clinical agent has dimensions a customer support agent does not. Reuse the orthogonal-decomposition pattern, not a fixed dimension list.

Evidence and sources

FAQ

How many dimensions should I track? Start with the orthogonal set in this post (state tracking, intent drift, persona, tool argument correctness, recovery) and add domain-specific dimensions only when a real failure pattern motivates it. Composability matters; ten overlapping dimensions are worse than five independent ones.

Should I score every conversation or sample? Sample, for cost. Scoring every conversation across multiple LLM-judge evaluators is rarely affordable at scale. A representative sampling rate plus full scoring on every deployment candidate is the usual shape.

What about conversations that go on for fifty turns? Decompose. A long conversation is evaluated in windows of three to five turns for turn-level dimensions and as a whole for conversation-level dimensions (completeness, retention). The window size is a tuning parameter, not a constant.

How do I detect intent drift specifically? A judge prompt that compares the current agent state to the original user goal, paired with a recovery probe: when the user re-states or clarifies, does the agent update. Drift is the agent's deviation from the goal; recovery is the agent's correction when the goal is restated.

What is the relationship between conversation evaluation and observability? Observability captures the trace; conversation evaluation scores it. Both are needed. The observability platform answers "what happened"; the evaluator suite answers "was it good along each dimension."

Are these evaluators model-agnostic? Yes. The same evaluator runs against any underlying agent model. That is the point of separating the evaluator from the implementation: the evaluation framework remains constant as models and prompts change.

Related reading