How do you monitor AI agents in production?

How do you monitor AI agents in production?

Updated: 2026-02-24 By: Ari Heljakka

Short answer

Monitoring an AI agent in production is monitoring a non-deterministic, multi-step, tool-using system whose failures often look like successes. The practical loop has five moves: capture every step as a structured trace; score a sampled fraction with versioned evaluators tied to named objectives; set SLOs per objective and per failure mode, not per request; alert on score drift and on rare-event clusters, not on absolute thresholds; route every production failure back into the calibration dataset so the next deploy is gated against it. Done well, the agent stops surprising the on-call rotation. Done poorly, the dashboard is green while users complain.

Key facts

  • Definition: Production monitoring for agents is the structured capture, evaluation, alerting, and feedback loop that keeps a non-deterministic LLM system within stated objectives. It is observability plus continuous evaluation plus an on-call workflow.
  • When to use: Any production agent with chained prompts, retrieval, tool use, or multi-turn state, where wrong answers are more expensive than slow answers.
  • Limitations: Monitoring tells you what is happening; it does not fix bad rubrics, weak calibration, or an underspecified objective. The loop is only as reliable as its judges.
  • Example: A support agent emits a trace per session with spans for each tool call and LLM turn. A sampled evaluator scores faithfulness, refund-policy adherence, and tone. Drift on any dimension alerts on-call, and any flagged trace is added to the calibration dataset before the next release.

Key takeaways

  • Per-request error rates miss most agent failures. The failure modes are silent tool failures, corrupted state, and goal-level failures that look like successful turns.
  • SLOs belong on objectives (faithfulness, instruction following, refund policy), not just on system metrics (latency, cost, error rate).
  • Statistical baselines beat fixed thresholds. The right alert is "the distribution of score X moved more than its rolling baseline," not "score X is below 0.7."
  • Tail-based sampling driven by evaluator scores is the cheapest way to keep audit-quality traces of the things that matter.
  • Every production failure that does not become a pre-deployment evaluation case is a regression waiting to happen again.

Definition

Production monitoring for AI agents is the operational practice of continuously capturing what an agent did, scoring it against versioned objectives, surfacing drift as actionable signals, and routing failures back into the evaluation set that gates the next release. It extends classical observability with three additions: traces designed for non-deterministic, multi-step runs; managed evaluators that score dimensions of quality alongside system metrics; and a feedback loop where flagged production traces become pre-deployment calibration data.

The unit of capture is a session or run, not a single LLM call. The unit of evaluation is a scored sample against a versioned objective. The unit of alerting is a distribution shift on a named dimension. The unit of remediation is a new evaluation case in the calibration set.

When this matters

Agent monitoring becomes decisive when one or more of these is true:

  • The agent uses tools. Tools fail silently: an expired token, a schema drift, a stale endpoint. The LLM keeps generating output; the user sees a confidently wrong answer.
  • The agent is multi-turn. A bad early turn corrupts later context. A failure at step 3 may not surface until step 8. Session-level traces and session-level scoring are the only way to see this.
  • Outputs are user-facing or decision-critical. Refund advice, medical triage, code suggestions: each has an objective beyond "did the model respond." Per-objective monitoring is the only way to tell whether the agent did its job.
  • The agent is non-deterministic. Two runs on the same input return different outputs. A fixed threshold on any single response is not a meaningful signal; a baseline on the distribution is.
  • Failures are rare and expensive. A 0.5 percent rate of off-policy refunds is invisible in a dashboard of averages and very visible on the regulator's desk. Rare-event monitoring needs cluster detection, not threshold alerting.

If none holds, request-level metrics may be enough. If any does, agent-specific monitoring is the difference between a quiet on-call rotation and a series of surprises.

How it works

Step 1: instrument session-level traces

Each user-facing session emits a single trace. Inside the trace, spans correspond to each step the agent took: retrieval, planner LLM call, tool invocation, responder LLM call, guardrail check. Attributes capture model, prompt, retrieved documents, tool inputs and outputs, latency, tokens, and cost. Trace IDs stitch sessions to upstream and downstream events.

OpenTelemetry GenAI semantic conventions provide a vendor-neutral schema. Using them keeps the trace store decoupled from any single platform.

Step 2: capture tool use and decision points

Tools are the most common silent failure. Every tool call records the request, the raw response, the parsed response, the success flag, and the latency. Decision points (router LLM picks a tool, planner decides to retry, guardrail rejects an output) are spans, not log lines. The trace must answer "what did the agent decide, and based on what."

Step 3: run evaluators on a sampled fraction

A managed evaluator catalogue holds one versioned judge per named objective. Each judge has a pinned model, a pinned prompt, a threshold, and a calibration dataset. A sample of production traces flows into the evaluator pipeline; scores are written back to the trace as first-class attributes with explicit lineage (rubric version, judge version, dataset version).

The dimensions to score are objective-specific. For a support agent, the dimensions often include:

  • Faithfulness to retrieved context.
  • Policy adherence (refund, escalation, disclosure).
  • Tone appropriate to the channel.
  • Tool use correctness (right tool, right inputs).
  • Goal completion (did the session resolve the user's intent).

Each dimension is scored independently and composed into a session-level scorecard.

Step 4: define SLOs per objective and per failure mode

Three layers of SLO:

  • System SLOs. Latency P95, error rate, cost per session. Standard service metrics.
  • Objective SLOs. Faithfulness rolling average above 0.85; policy violations below 0.5 percent of sessions; goal completion above 0.7. These are the agent's job description, expressed as numbers.
  • Failure-mode SLOs. Specific failure clusters (tool authentication failures, format violations, off-policy refunds) tracked as their own series with their own thresholds.

The SLO is the contract the on-call rotation defends.

Step 5: alert on drift, not on absolute numbers

A fixed threshold on a noisy score produces noisy alerts. The signal that pays off is the rolling baseline:

  • Score drift. A named dimension's rolling 1-hour average moves more than N standard deviations from its 7-day baseline.
  • Cluster emergence. A failure mode whose count was zero last week is non-zero today.
  • Sampling-aware rare events. A single off-policy response triggers an immediate alert, because the sample rate implies the population.
  • Calibration drift. Judge agreement with human labels drops below threshold (a meta-alert on the evaluator itself).

Each alert names the dimension, the rubric version, the trace IDs, and the dashboard link. The on-call workflow starts with "look at the flagged traces," not "go grepping."

Step 6: route failures back into the evaluation set

Every flagged production trace is a candidate for the calibration dataset. The on-call workflow includes:

  • Triage the alert.
  • Label the flagged traces (correct, false positive, real failure, edge case).
  • Add real failures to the calibration set with the correct label.
  • Re-run the evaluator catalogue against the updated set.
  • Confirm that the failure now triggers a regression in the pre-deployment evaluation.

This is the loop that makes monitoring compound. A failure caught once becomes a permanent gate against its recurrence.

Step 7: tail-based sampling by score

Keep the budget bounded by sampling traces aggressively, with one exception: any trace where any dimension scored below threshold is kept at full fidelity. The kept population is biased toward interesting traces; the sampled rest gives enough for distribution metrics.

Example

A team runs a multi-turn support agent for billing and technical issues.

Trace shape: one trace per session. Spans for each user turn, the retriever, the planner LLM, each tool call (billing lookup, refund API, knowledge base search), the responder LLM, and a guardrail.

Evaluator catalogue:

  • Faithfulness against retrieved knowledge-base documents.
  • Refund-policy adherence (rule-based, fast).
  • Tool-use correctness (LLM judge, rubric specifies allowed tools per intent).
  • Goal completion (LLM judge, rubric specifies what a resolved session looks like).
  • Tone (LLM judge, rubric specifies channel-appropriate register).

Each evaluator has a calibration dataset of 300 labelled sessions, refreshed monthly.

SLOs:

  • Latency P95 below 8 seconds for billing intents, 12 seconds for technical.
  • Faithfulness rolling 1-hour above 0.85.
  • Refund-policy violations below 0.5 percent of refund-bearing sessions.
  • Goal completion above 0.70.
  • Judge agreement with human labels above 0.85 (meta-SLO).

Alerts:

  • Faithfulness drift more than 2 sigma over 6 hours.
  • Any refund-policy violation: immediate page.
  • Goal completion drift more than 2 sigma over 24 hours.
  • New failure cluster emerges (e.g., a new tool error type).
  • Judge agreement drops below 0.85.

Sampling: 5 percent head-sampling for all sessions, plus 100 percent retention for any session where any score is below threshold.

Feedback loop: every flagged session is reviewed within 24 hours, labelled, and either added to the calibration dataset or marked as a false positive. False positives drive evaluator refinement; real failures become evaluation cases that gate the next deploy.

The on-call rotation runs against named objectives and named failure modes. The dashboard shows score distributions, drift, and the live SLO summary. The trace viewer is the second screen, opened when an alert fires.

Limitations

  • Judge quality is the ceiling. A judge that disagrees with humans 25 percent of the time produces noisy SLOs and noisy alerts. Calibration is not optional; it is the baseline.
  • Sampling has gaps. Tail-based sampling on score keeps the things that scored badly; it cannot keep the things the judge missed. A miscalibrated judge under-keeps the failures it does not recognise.
  • Multi-turn state is hard. A trace can capture every span; reconstructing the agent's internal state across turns still requires deliberate schema. Without it, the trace tells you what happened and not why the model thought it was reasonable.
  • Alert fatigue is the most common failure. Threshold-based alerts on noisy scores produce paging that the team learns to ignore. Drift-based alerts on rolling baselines, with named dimensions and trace links, are the antidote.
  • Cost compounds. Inline evaluators add to user-facing latency and per-request cost. Async evaluators on a sampled fraction are usually the right shape; full inline scoring is reserved for surfaces where gating the response is worth the latency tax.
  • The feedback loop only works if it closes. Failures that surface and never get labelled are noise. The on-call workflow must include "label and route to calibration set" as an explicit step.

Evidence and sources

FAQ

How do I set the right threshold for a new objective? You do not set the threshold first; you set the calibration dataset first. Label 200 to 300 sessions, run the judge against them, measure agreement with the labels, and pick the threshold that makes the false-positive and false-negative rates land where the product can absorb them. The threshold is downstream of the calibration data, not upstream.

Should I run evaluators inline or asynchronously? Async by default. Inline only when the score gates the response (refuse, rewrite, retry) and the latency tax is worth the control. The cost and failure-mode story for async is much cleaner.

How do I avoid alert fatigue? Alert on drift against rolling baselines, not absolute thresholds. Name the dimension, link the flagged traces, route to a named owner. A page that lands in a channel with no owner is noise by construction.

What is the right sample rate? For distribution metrics, 1 to 10 percent head-sampling is usually enough. For audit and rare events, tail-based sampling on evaluator score is the right shape: keep every trace that scored below threshold, sample the rest.

How do I monitor an agent that has no clear ground truth? Decompose the objective into dimensions where ground truth is achievable on a calibration set, even if production has none. Faithfulness against a retrieved document is checkable. Tone is checkable against examples. Goal completion is checkable on a labelled session set. The aggregate score may be opinion; the components can be calibrated.

What is the first thing to build on day one? A session-level trace per request, an OpenTelemetry export, a single evaluator on the most important dimension, a small calibration set, and one SLO. Everything else is incremental on top of that.

Related reading