How to Use Production Traces to Make AI Evaluations

How to Use Production Traces to Make AI Evaluations

Updated: 2026-02-23 By: Ari Heljakka

Short answer

The most reliable way to generate AI evaluations is to harvest them from production traffic. Capture full traces, surface anomaly-rich examples, label a few dozen per failure mode, and convert those labels into versioned evaluators (rule-based where possible, LLM-as-judge where required) that gate every subsequent deployment. The output is not a static benchmark; it is a continuously updated set of managed evaluators tied to ground-truth labels drawn from the same distribution your users actually create.

Key facts

  • Definition: A four-stage loop (observe, annotate, generate, iterate) that turns production traces into versioned, calibrated evaluators.
  • When to use: Any AI system with real users and recurring failure patterns where synthetic benchmarks fail to predict live behavior.
  • Limitations: Requires trace instrumentation, an annotation discipline, and ongoing calibration against human labels. Cold-start systems with no production traffic need synthetic bootstrap data first.
  • Example: A support agent surfaces 100 anomalous traces per week; experts label 15 examples per recurring failure type; rule and judge evaluators are generated and gated in CI, with Matthews correlation tracked over time.

Key takeaways

  • Production data is the calibration substrate. Static benchmarks miss the long tail; live traffic does not.
  • Treat trace instrumentation as evaluation infrastructure, not as logging. Inputs, intermediate steps, tool calls, and outputs all carry signal worth capturing.
  • Annotate against anomalies, not random samples. Twenty hand-labeled failures per pattern beats a thousand uniformly sampled traces.
  • Generate evaluators from labels. Each evaluator is a managed, versioned artifact tied to a specific failure mode and a specific calibration slice.
  • Measure evaluator quality continuously. Agreement with human labels, expressed on a normalized 0 to 1 scale, drives both calibration and trust.

Definition

Generating evaluations from production data means converting real user traces into versioned, calibrated checks that score every new release. The pipeline separates two concerns cleanly. The objective is the failure mode you want to prevent (for example, "the agent invents a tool result"). The implementation is the evaluator (a regex on tool output shape, a JSON schema check, an LLM judge with a rubric, or a small classifier). Multiple implementations can serve the same objective; the right one is the one that survives calibration against human labels.

This is fundamentally different from running public benchmarks. Benchmarks ask whether your system is good at a generic task; production-derived evaluations ask whether your next release is at least as good as the current one on the exact distribution your users actually produce.

When this matters

Production-derived evaluations earn their cost in several common situations:

  • Long-tailed inputs. Customer support, sales triage, document extraction, and agent workflows generate inputs no benchmark designer anticipated.
  • High-stakes regressions. Even a 1 percent drop in correctness on a critical slice can outweigh aggregate improvements; only a calibrated dataset can catch it.
  • Frequent prompt or model changes. Every iteration introduces silent regressions; an evaluation suite tied to live traffic catches them before users do.
  • Agent systems with tool calls. Span-level traces reveal failures (wrong tool, malformed arguments, ignored result) that input-output evaluations miss entirely.

If your system runs at low volume with hand-curated inputs, a static suite is fine. Past a few thousand requests per day, the live distribution starts to dominate, and only a feedback loop tied to that distribution will catch what matters.

How it works

The loop has four stages. Each stage produces an artifact the next stage consumes.

Stage 1: Observe

Instrument every interaction. Capture inputs, outputs, intermediate reasoning, tool calls, conversation turns, and span-level timing. Use a standard tracing format (OpenTelemetry or compatible) so the data is portable. Tag spans with session identifiers, prompt version, model version, and any feature flags so traces can be sliced after the fact.

Two design decisions matter here. First, capture enough context that a reviewer can reconstruct what the model saw; this includes the system prompt, retrieved documents, and tool schemas. Second, store traces somewhere queryable; the annotation stage needs to filter by anomaly heuristics, not scroll through raw logs.

Stage 2: Annotate

Random sampling wastes annotator time. Instead, prioritize traces by anomaly signals:

  • Context window pressure. Conversations that approach the model's input limit are over-represented in failures.
  • Token usage spikes. Long outputs often signal hallucinated reasoning or runaway loops.
  • User re-submission. A user who rephrases or retries the same request just revealed a failure.
  • Tool-call retries or errors. Repeated tool calls with adjusted arguments indicate the model could not parse the result.
  • Low confidence or guardrail trips. Existing rule-based filters surface near-misses for review.

A domain expert reviews the anomaly queue and labels each example with the failure mode it represents. Aim for ten to twenty labeled examples per failure mode before generating an evaluator. Comprehensive coverage is not the goal; representative coverage of each pattern is.

Treat the labeled set as a versioned ground-truth dataset. Every label is dated, attributed to a reviewer, and immutable. Re-labeling under a revised rubric produces a new dataset version, not an overwrite.

Stage 3: Generate

Each labeled failure mode becomes one or more evaluators. There are two common implementations:

  • Rule-based evaluators. Regex, JSON schema validators, length and structure checks, banned-phrase lists, tool-call argument validation. Fast, deterministic, near-zero cost. Use these whenever the failure is structural.
  • LLM-as-judge evaluators. A judge prompt that takes the input, the output, and a rubric, and returns a normalized score (0 to 1) with a justification. Use these for open-ended failures (hallucination, tone, faithfulness, helpfulness).

Both kinds of evaluators are first-class versioned components: each one has an identifier, a rubric, a calibration set, and a current agreement score against human labels. Each evaluator outputs a normalized score so that independent dimensions can be composed into a scorecard, and tradeoffs can be reasoned about quantitatively.

A useful default is to compose two or three evaluators per failure mode: a cheap deterministic check for the structural part, a judge for the semantic part, and where stakes warrant it, a human sampling rate for ongoing calibration.

Stage 4: Iterate

Evaluator quality is itself a measurable signal. Track it.

The standard metric for evaluator-vs-human agreement on classification tasks is the Matthews Correlation Coefficient, which ranges from -1 to 1 and is robust to class imbalance. A working rule of thumb:

  • MCC above 0.6: the evaluator is reliable enough to gate deployments.
  • MCC between 0.4 and 0.6: useful as a signal, not as a hard gate; recalibrate or split into sub-evaluators.
  • MCC below 0.4: the evaluator is not measuring what the rubric claims; revise the rubric or the prompt.

For graded (non-binary) judges, use rank correlation against human scores on the same examples and watch for drift as model versions change.

Wire the evaluators into CI. Every prompt change, model swap, or judge update runs the full suite against the labeled dataset. A regression below the threshold blocks the deployment. Prune evaluators whose failure mode has been structurally fixed; add new evaluators as new failure modes emerge from the anomaly queue.

This stage closes the loop. New production traffic surfaces new anomalies, which become new labels, which become new evaluators, which gate new releases.

Example

A team running a research assistant for analysts ingests roughly 8,000 traces per day. They wire an anomaly queue that surfaces about 120 traces per week prioritized by context pressure, tool retries, and user re-submissions.

Two analysts spend two hours per week labeling. After four weeks they have:

  • 18 examples of "agent invents a citation that does not exist in the retrieved documents"
  • 22 examples of "agent calls the wrong tool for the user's question"
  • 14 examples of "agent ignores the tool result and answers from prior knowledge"
  • 11 examples of "agent produces a response that exceeds the response budget without summarizing"

From these labels they generate:

  • A deterministic check that every citation in the output must appear verbatim in the retrieved documents (catches the first failure mode at near-zero cost).
  • A judge that scores tool-call appropriateness given the user's question against a one-paragraph rubric (catches the second mode; calibrated against the 22 labels with MCC of 0.71).
  • A judge that scores whether the answer is grounded in the tool result (catches the third mode; MCC 0.64 after one rubric revision).
  • A length check plus a summary-style check for the fourth mode.

The suite runs in CI on every change. A pre-deployment release that swaps the underlying model fails the citation-grounding check (regression from 0.92 to 0.78 on the labeled slice). The team rolls back, investigates, and identifies a prompt template the new model interprets differently. The fix ships a week later, with the evaluators unchanged.

Limitations

  • Cold-start systems have no traffic. Bootstrap with synthetic data or a small pilot, but switch to production-derived evaluation as soon as real traffic exists.
  • Annotation discipline is a fixed cost. Two hours per week is sustainable; eight hours per week is not. Scope failure modes accordingly.
  • Judges drift. When the underlying judge model is updated, recalibrate against the labeled set. Track agreement as a metric in its own right, with alerts on drops.
  • Privacy and consent. Production traces contain user data. Sample with consent, redact PII before annotation, and apply the same access controls to the labeled dataset that you apply to production data.
  • Evaluator overfitting. Iterating prompts against a static labeled set eventually fits the set more than the task. Refresh the labels from new production samples on a cadence; treat the delta between fresh and historical labels as a drift signal.

Evidence and sources

FAQ

How many labeled examples do I need per failure mode? Ten to twenty is usually enough to bootstrap a useful evaluator. The point is not statistical power; it is sufficient calibration signal to measure whether the evaluator agrees with human judgment. Add more labels when agreement is borderline or when the failure pattern is heterogeneous.

What if my production volume is too low for anomaly sampling? Use uniform sampling instead, but accept that labeled coverage will be sparse. As traffic grows, switch to anomaly-prioritized sampling. In the meantime, synthetic adversarial inputs can fill the gap for known categories (prompt injection, malformed inputs, multilingual variants).

Should I prefer rule-based evaluators or LLM judges? Use rule-based whenever the failure is structural (schema, format, presence of a required field). Use a judge whenever the failure is semantic (faithfulness, tone, helpfulness). Most failure modes need both: a cheap deterministic guard for the obvious failures and a judge for the rest.

How do I know when an evaluator is good enough to gate deployments? Agreement with human labels on a held-out calibration slice. For binary classification, Matthews Correlation above 0.6 is a defensible threshold. For graded scores, watch rank correlation and stability across model versions. Below those bars, treat the evaluator as a signal, not a gate.

How often should I refresh the labeled dataset? On a weekly or monthly cadence depending on traffic volume and rate of distribution change. Whenever a fresh production sample scores meaningfully differently than the historical labels, that delta is itself the signal: either user behavior has shifted, or the rubric needs revision.

Related reading