How do you combine agent observability and evaluations?

How do you combine agent observability and evaluations?

Updated: 2026-01-27 By: Ari Heljakka

Short answer

Observability and evaluation are two halves of the same loop. Observability captures what the agent did (traces, spans, tool calls, side effects). Evaluation scores whether what it did was good (orthogonal 0 to 1 dimensions against a versioned ground-truth dataset). Each half is incomplete on its own. Observability without evaluation gives the team a search interface for incidents, while evaluation without observability gives a benchmark with no live signal. The integrated loop turns every trace into a candidate for scoring, every score into a candidate for alerting and gating, every confirmed failure into a labeled regression case, and every fix back into a new trace. This guide walks the complete cycle with the trade-offs that decide whether the loop closes or leaks.

Key facts

  • Definition: Integrated agent observability and evaluation is the continuous loop that captures full trajectories, scores them against versioned dimensional rubrics, surfaces drift through dashboards and alerts, gates deployments on regression sets, and feeds confirmed failures back into the ground-truth dataset.
  • When to use: Any multi-turn, tool-using, or multi-agent system where regressions are expensive and where failure modes are silent at the single-call level.
  • Limitations: Setup cost compounds across instrumentation, dataset construction, judge calibration, gate wiring, and alert tuning; cheap on day one becomes expensive on month six if any layer is skipped.
  • Example: A support agent runs trajectory capture in production, samples ten percent for trajectory-level scoring, gates merges on per-dimension floors over the regression set, alerts on per-dimension drift in production, and feeds every confirmed incident back into the ground-truth dataset.

Key takeaways

  • Observability and evaluation are interdependent. Each half degrades without the other.
  • The unit of integration is the trajectory: one tree of spans carries both the operational data (latency, cost, errors) and the evaluable signal (per-dimension scores).
  • Three evaluation surfaces use the same evaluators: offline (pre-deploy against a regression set), online (continuous sampled scoring in production), ad hoc (incident-driven on a specific cluster). Consistency across the three is what makes any one of them trustworthy.
  • Evaluators are versioned managed components. Judge agreement against ground truth is the meta-metric that decides whether the loop is calibrated.
  • The feedback loop closes when every confirmed production failure becomes a labeled regression case before the next deploy.

Definition

Agent observability is the practice of capturing structured runtime data from an autonomous LLM-driven system: the user goal, every model call, every tool call (with structured request and structured response), planner outputs, retries, side effects. Observability emits traces (full trajectories), composed of spans (individual operations), grouped into threads (multi-turn sessions). The data model is OpenTelemetry-shaped in practice; the unit is the trajectory tree.

Agent evaluation is the practice of scoring those trajectories against orthogonal quality dimensions (goal completion, tool-use correctness, context retention, reasoning quality, multi-agent handoff fidelity, safety), each normalised to 0 to 1, calibrated against a versioned ground-truth dataset. Evaluators are managed components with their own versions, calibration history, and agreement metrics against human labels.

The integrated loop is what turns the two halves into a system: trajectories from observability become inputs to evaluation; evaluation scores become inputs to dashboards, alerts, and deployment gates; gates and alerts produce confirmed failures that become inputs to dataset growth; dataset growth tightens the evaluator; the tightened evaluator scores the next round of trajectories. The loop has no head and no tail; each turn raises the floor.

When this matters

The integrated loop matters when any of these holds:

  • Silent failures. When the agent's final response can look fluent and on-topic while the user's goal goes unmet, only trajectory-level scoring catches the gap.
  • Multi-turn state. Context-loss failures are invisible at the single-call level; they require trajectory scoring against context-retention rubrics.
  • Tool calls. Tool misuse is the dominant agent failure mode; it requires structured request and response capture (observability) plus a tool-use correctness evaluator (evaluation), and gates on both.
  • Continuous deployment. When prompts, models, or tools change often, every change must clear a regression gate scored on the same dimensions as production monitoring.
  • Audit and compliance. When an incident review must answer "which versioned rubric, judged by which evaluator version, against which dataset version, produced the score that flagged or gated this output," the lineage must be a first-class concern.

If your system is single-call with no tools and no multi-turn state, a lighter setup is fine. The complete loop is sized for the multi-turn, tool-using, goal-driven case.

How it works

The integrated loop has six components. Each is described as a property to aim for, not a tool that ships it.

Component 1, instrumented trajectories

Every session is captured as a tree of spans rooted on the user goal: planner outputs, model calls, tool calls (with structured request and structured response), retries, side effects. The data model is OpenTelemetry-shaped or equivalent. Free-text logs are not enough; the schema is what makes downstream scoring, clustering, and lineage possible.

What "good" looks like: structured tool-call capture (no opaque JSON-in-text), span attributes that follow the OTel GenAI semantic conventions, parent-child relationships preserved, side effects captured, sampling rate tunable per environment.

Component 2, the dimension set

Quality is decomposed into orthogonal 0 to 1 dimensions chosen for the workload. A typical agent set: goal completion, tool-use correctness, context retention, reasoning quality, safety, format compliance. Composition into an aggregate score happens with explicit, documented weights so a regression on one dimension is actionable, not collapsed into a single number.

What "good" looks like: dimensions are independent (no double-counting), each is scored by one or more managed evaluators, the composition function is versioned alongside the dimensions, per-dimension thresholds are set before evaluation runs, not after.

Component 3, the ground-truth dataset

A versioned collection of labeled trajectories drawn from real production traces and augmented with adversarial cases. Each trajectory carries per-dimension labels. The dataset grows continuously: every confirmed production failure becomes a regression case.

What "good" looks like: dataset is versioned alongside code, labels are produced by a documented process (human review, expert review, or judge-assisted with human spot-check), refresh cadence is documented, the dataset is split into common, edge, and adversarial slices each gated independently.

Component 4, calibrated managed evaluators

Each evaluator is a managed component with its own version, prompt, model pin (where applicable), and agreement metric against the ground-truth dataset. Judge agreement is itself a tracked metric; drift below a threshold blocks deployment of the evaluator, not just of the system under evaluation.

What "good" looks like: every evaluator has a pinned model and a versioned prompt, agreement against ground truth is recomputed on a fixed cadence, evaluator changes go through the same gate as application changes, the system can attribute any score back to the exact evaluator version that produced it.

Component 5, three evaluation surfaces sharing one evaluator panel

The same evaluators run in three places:

  • Offline. Pre-deploy, against the full regression set. Gates merges. Output is a per-dimension pass-fail.
  • Online. Continuously, against sampled production trajectories. Feeds dashboards and per-dimension drift alerts. Sampling rate tunes cost.
  • Ad hoc. Incident-driven, against a specific cluster of trajectories. Used during root-cause analysis.

Consistency across the three is what makes any one of them trustworthy. A green offline gate that disagrees with online drift is a sign the dataset has drifted from the live distribution, not a sign the gate is fine.

Component 6, the production-to-evaluation feedback loop

Every confirmed production failure (from an alert, an incident review, or user feedback) becomes a labeled regression case in the ground-truth dataset before the next deploy. Every reproduced failure tightens the gate that should have caught it. This loop is the only thing that prevents dataset drift from silently degrading the entire system over time.

What "good" looks like: failure-to-regression-case is a documented workflow with an SLA, the dataset includes the date and provenance of every case, periodic audits remove stale cases, the regression set grows but is not allowed to bloat past what CI can run in a reasonable budget.

Example

A multi-turn support agent. The complete loop in operation:

  • Instrumentation. Every session emits an OTel-shaped trace with structured tool calls (CRM lookup, order status, returns API). Sampling: 100 percent in staging, 10 percent in production.
  • Dimensions. Six orthogonal axes: goal completion, tool-use correctness, context retention, reasoning quality, safety, format compliance. Composition into an aggregate uses documented weights; per-dimension floors are 0.85 for safety, 0.80 for goal completion, 0.90 for tool-use, 0.75 for the rest.
  • Ground truth. 412 labeled trajectories, split 70 common, 25 edge, 17 adversarial. Refreshed monthly; every confirmed production failure becomes a regression case within seven days.
  • Evaluators. Six managed evaluators, one per dimension. Each is a pinned-model LLM-as-judge with a versioned prompt. Judge agreement against the ground truth is recomputed weekly; drift below 0.85 agreement on any dimension blocks evaluator deployment.
  • Offline. Every merge triggers the regression run. A merge is blocked if any per-dimension floor fails on any slice.
  • Online. Sampled production trajectories are scored continuously. Per-dimension drift alerts fire when the 24-hour moving average drops more than 0.05 below the trailing-week baseline.
  • Ad hoc. When the failure-clustering layer surfaces a new cluster, an analyst runs the evaluators against the cluster's trajectories with finer-grained drill-down.
  • Feedback loop. Last quarter, 47 confirmed production failures were converted into regression cases; the agent's overall goal-completion score moved from 0.62 to the high 0.70s over eight weeks. The same evaluator panel produced every score along the way; the rubric versions and dataset versions for each weekly snapshot are queryable from the platform.

Limitations

  • Setup cost is real. Six components, each with its own learning curve and operational ownership. Teams that skip components discover at month six which component they skipped.
  • Judge cost compounds. Trajectory-level scoring on a frontier judge is expensive at production scale; sampling, tiered evaluation (cheap deterministic checks first), and per-tier judges are operationally mandatory.
  • Datasets rot. A ground-truth dataset built six months ago against a different traffic distribution gives confidently wrong scores. The refresh discipline is what keeps the loop honest.
  • Evaluator drift looks identical to system drift. When agreement drops, the evaluator is the suspect first, not the agent. Without judge-agreement tracking, root cause is ambiguous.
  • Per-dimension thresholds can fight each other. Tightening safety can lower helpfulness; tightening goal completion can raise tool calls. The composition function and the thresholds are themselves an evolving design.
  • Multi-agent attribution is hard. Even with per-edge scoring, complex agent graphs produce cases where attribution is irreducibly ambiguous. Plan for it; do not pretend it is solved.
  • Alert fatigue. Per-dimension drift alerts can fire often if thresholds are too tight. Tune on rate of change, not absolute values, and treat alert tuning as part of the operational ownership of the loop.

Evidence and sources

Numeric figures in this post (sampling rates, threshold values, dataset sizes) are illustrative; calibrate against your own workload before using them in planning.

FAQ

Why not just observability, with evaluation as a phase later? Because the failure modes that matter at production scale (silent tool misuse, context loss, goal drift) are invisible at the observation layer; you can search the logs and never find them. Evaluation is what turns trajectories into scored deviations against an objective.

Why not just evaluation, against a static benchmark? Because static benchmarks drift away from the live distribution. Without observability feeding the dataset, the regression set ages out and the gate starts shipping regressions it would have caught a quarter ago.

Can the same evaluators run online and offline? Yes; this is the design goal. Offline-online evaluator parity is what makes a green offline gate trustworthy. Divergence between the two is a signal the dataset has drifted from production, not that the gate is fine.

How do I size the sampling rate in production? Start with what your judge budget allows. Stratify the sample so each dimension gets enough scored trajectories per day for the drift detector to be sensitive. Raise the rate on high-risk surfaces; lower it on stable ones.

How often should the ground-truth dataset be refreshed? Continuously through the feedback loop (every confirmed failure becomes a case within days), plus a scheduled audit (monthly or quarterly) to retire stale cases and re-balance the slices.

What is the meta-metric that says the loop itself is healthy? Judge agreement against the ground-truth dataset, tracked per evaluator per dimension over time. When agreement drops, the loop is degraded even if the system under evaluation looks fine.

Related reading