Updated: 2026-03-17 By: Ari Heljakka
Short answer
Observing an agentic system means recording every run, trace, and thread with enough fidelity that a third party could reconstruct what happened. Evaluating that system means scoring those same recordings on orthogonal dimensions, against a versioned ground-truth dataset, with calibrated judges. The two practices are not interchangeable, and neither is sufficient on its own. Observability without evaluation produces dashboards no one trusts; evaluation without observability produces scores no one can trace back to a cause. The combined practice ties each score to the exact span that earned it, then feeds production failures back into the dataset that gates the next deployment.
Key facts
- Definition: Observation plus evaluation is the joint practice of capturing structured agent execution data (runs, traces, threads) and continuously scoring it against versioned objectives.
- When to use: Any agent that is multi-turn, tool-using, or operating against a live user base where failures cost money or trust.
- Limitations: Requires up-front investment in tracing schema, dimensions, and ground truth; ongoing operational cost for sampling, judges, and label refresh.
- Example: A support agent's per-turn observability looks healthy while goal-level completion drops 8 points week over week; only paired evaluation against goal-level labels surfaces the regression.
Key takeaways
- Observability captures execution; evaluation scores it. Both are needed, in the same trace schema.
- The three primitives of observation are runs, traces, and threads. The three levels of evaluation are single-step, full-turn, and multi-turn.
- Evaluation dimensions are orthogonal and normalized to 0 to 1. Composition happens later; do not collapse to a single number.
- The same evaluators run in CI/CD and in production. Consistency between the two is what makes the gate trustworthy.
- The feedback loop from production traces back into the ground-truth dataset is what keeps evaluation grounded in production reality over time.
Definition
Observation is the structured capture of an agent's execution: every model call, every tool call (with structured request and structured response), every retry, every plan, every side effect, organized into spans that share a trace identifier.
Evaluation is the scoring of those captured spans on a set of objectives (faithfulness, tool-call quality, plan coherence, policy adherence, goal completion) against a calibrated ground-truth dataset, with judges that are themselves versioned artifacts.
The two practices share a trace schema. Evaluators consume what observability produces. The dataset that calibrates the judges is sampled from the same traces that flow through observability. The metrics returned by the judges land in the same dashboards as the latency and error counters from the observability layer.
When this matters
The combined practice earns its keep when at least two of these hold:
- Per-turn behavior looks acceptable while goal-level completion or business outcomes degrade.
- Multiple prompt or model revisions ship in parallel with overlapping ownership.
- Failures cluster in patterns that no one notices until a user reports them.
- The agent operates over multiple turns or invokes tools whose responses shape later decisions.
- Regulatory or audit pressure makes the trace itself a deliverable, not just an internal debugging aid.
Single-turn extraction pipelines can get away with thinner observability and simpler evaluation. The discipline below is sized for the multi-turn, tool-using, goal-driven case.
How it works
The combined practice rests on five linked components. Each is described as a property to aim for, not a tool to install.
Component 1, the three observation primitives
Three primitives organize what gets captured:
- Runs. A single execution step: one model call with its inputs, outputs, parameters, and timing. The atomic unit of observation.
- Traces. A complete sequence of runs that make up one full agent execution: planner, retriever, tool calls, model calls, response.
- Threads. Grouped traces that represent a multi-turn conversation, including human interventions and out-of-band corrections.
A trace schema that distinguishes these three levels makes every later evaluation step possible. A schema that collapses them makes evaluation guess at structure that was never recorded.
Component 2, the three evaluation levels
Evaluation runs at three levels that mirror the observation primitives:
- Single-step. Score one model call or one tool call. Faithfulness of a single response; correctness of a single tool argument.
- Full-turn. Score one full trace end to end. Did the plan match the goal? Did the executor follow the plan? Did the final answer satisfy the user's turn?
- Multi-turn. Score a thread. Did the agent eventually reach the user's actual goal? Did it accumulate state correctly across turns?
Each level produces orthogonal signal. A single-step score that looks healthy does not imply a full-turn score that looks healthy; a full-turn score that looks healthy does not imply a multi-turn score that looks healthy.
Component 3, the bridge between observation and evaluation
The trace schema is the bridge. Every evaluator consumes structured spans, not free text. Every score the evaluator produces attaches back to the span that earned it, so the dashboard view and the evaluation view share an identifier.
Two design rules keep the bridge structurally sound:
- Evaluators never re-parse free text when they could consume structured spans. Re-parsing is a leak in the trace schema.
- Scores include the span identifier they evaluated. A score with no anchor cannot be debugged, so it is operationally worthless.
Component 4, the workflow for improvement
A six-step loop converts captured traces into improved behavior:
- Enable tracing and capture full sessions, including failed ones.
- Deploy on real tasks against a representative population.
- Manually review and tag a sample of traces; surface failure clusters.
- Define or refine evaluation dimensions from the observed clusters.
- Iterate on prompts, retrieval, and tool definitions; gate changes on per-dimension floors.
- Scale by automating the evaluator suite over the sampled trace stream.
The loop runs continuously, not once. The dataset grows as production surfaces new failure patterns; the dimensions are refined as the team learns which collapses hid signal.
Component 5, evaluating and optimizing the prompts themselves
Prompts are versioned artifacts. Each version is scored on the ground-truth dataset before promotion. Each version's per-dimension scores are recorded alongside the prompt itself, so a regression two versions later can be traced to the change that introduced it. Prompt optimization is gated by the same evaluators that gate model changes; this is what keeps prompt iteration disciplined rather than vibe-driven.
Example
A team running a B2B support agent stands up the combined practice over six weeks:
- Week 0. Per-turn evaluation looks healthy at 0.92 average. Goal-level completion is 64 percent. The team trusts the per-turn dashboard and does not understand the gap.
- Week 2. Tracing primitives in place: every session captured as a thread; every trace structured into typed spans. The team reviews 50 failing traces by hand and identifies three failure clusters.
- Week 3. Three new evaluation dimensions added at full-turn and multi-turn levels: plan-execution coherence, tool-response interpretation, goal-level completion. Judges calibrated against 80 expert-labeled sessions.
- Week 4. Evaluators wired into both CI/CD and a 10 percent production sample. The same evaluators run in both places. First regression caught at the gate.
- Week 6. Production sampling has surfaced 22 new failure cases; 12 enter the dataset after expert labeling. Goal-level completion rises to 73 percent. The per-turn dashboard barely moved.
The improvement was not from a better model. It was from a combined practice that surfaced the gap, named the dimensions, and converted production failures into regression cases.
Limitations
Caveats worth flagging up front:
- Trace schema is the rate-limiter. Under-instrumented traces tax every later step. Invest in the schema before the evaluators.
- Judge cost compounds. Scoring 10 percent of production traffic on five dimensions is real money. Pin smaller models on cheap dimensions; reserve frontier judges where they outperform.
- Dataset growth is gated by expert labels. Goal-level labels are expensive. Prioritize labeling by cluster impact, not by recency.
- Per-turn dashboards lie. A healthy per-turn average is consistent with collapsing goal-level completion. Always evaluate at the level where the user experiences the outcome.
- Evaluator drift is a signal, not noise. Track judge-agreement with human labels weekly. Recalibrate when frontier models change.
Evidence and sources
- GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the run-trace-thread span model that observation consumes.
- Building Better AI Agents: Observability and Evaluation, https://www.youtube.com/watch?v=reISMhbZ2XE, for the combined-practice framing.
- Test and evaluate documentation, https://docs.anthropic.com/en/docs/test-and-evaluate/, for the versioned-rubric and judge-calibration discipline.
FAQ
Why not just observe and skip the evaluation layer? Observability alone produces dashboards that grow over time and trust that does not. Without scores, the team cannot tell whether a change made the system better or worse, so changes ship on hunches.
Why not just evaluate and skip the observability layer? Evaluation alone produces scores with nowhere to land. A failing score on a multi-turn thread is debug-able only if every span in the thread is captured with enough fidelity that the team can reproduce the failure.
At what level should I start, single-step, full-turn, or multi-turn? Start at the level where the user experiences the outcome. For a support agent, that is goal-level completion at the thread level. Add single-step and full-turn dimensions as you observe failure patterns that those levels miss.
How much production traffic should I sample for the eval layer? A 5 to 10 percent head-based sample is a sensible default. Bias upward on critical surfaces and downward on stable ones. Tail-based sampling for failures and boundary-biased for novel inputs are complements, not substitutes.
How do I keep the trace schema and evaluators from drifting apart? Treat the schema as a versioned interface. Evaluators name the span types they consume; schema changes that break an evaluator are caught in CI before they ship.