Updated: 2026-01-06 By: Ari Heljakka
Short answer
Agent monitoring is the boring infrastructure underneath every reliability claim, and most teams overbuild the dashboard and underbuild the loop. The defensible practice is a small set of metrics that drive weekly decisions (task completion, time-to-detect, time-to-resolve, recurrence rate, alert precision), alerts tagged with severity and owner from the moment of detection, a triage model that clusters incidents instead of dispatching them individually, and a continuous flow from incidents into the evaluator suite that gates the next release. The win condition is a measurable drop in incident recurrence over weeks, not a richer dashboard.
Key facts
- Definition: A six-step reliability loop (capture, detect, prioritize, respond, validate, learn) operated on a small set of weekly-actionable metrics.
- When to use: Any agent serving real users where quality regressions have business cost.
- Limitations: Without ownership per metric and per alert class, the program collapses into noise; without an eval feedback path, fixes never compound.
- Example: A retrieval-grounded support agent runs nine alert classes, six metrics, and a weekly cadence; recurrence on its top failure class drops by 12 times over a quarter.
Key takeaways
- Six metrics are enough; pick them once and review them weekly.
- Severity at the source is what makes alerts triageable.
- Cluster incidents before assigning; one cluster of 50 is one ticket, not 50.
- Every confirmed incident becomes an evaluator candidate; the candidate gates the next release.
- The metric that proves the program is working is recurrence rate per class trending down.
Definition
Agent monitoring is the practice of continuously observing a production agent and converting the observations into operational decisions: alerting an owner, triaging an incident, gating a release, refreshing an evaluator. It sits between raw telemetry (logs, traces, tool events) and the evaluation infrastructure that scores releases.
Three distinctions matter:
- Monitoring is not observability. Observability is the underlying data layer of traces, spans, and attributes. Monitoring is what you do with that data to keep the system reliable.
- Monitoring is not evaluation. Evaluation grades a release or a slice of traffic against a rubric. Monitoring watches the same traffic continuously and routes anomalies into the evaluation pipeline.
- Monitoring is operational, not analytical. Its job is to drive next-step decisions (resolve, escalate, retire), not to produce reports.
When this matters
- Production agents with stateful behavior. Tool calls, retrieval, memory, multi-turn dialogues all create failure surfaces that request-level monitoring misses.
- Tiered customer impact. When a P0 affects enterprise tenants and a P2 affects long-tail traffic, severity must be enforced or the queue inverts.
- Multiple owners across surfaces. Prompt, retriever, tool, guardrail, judge: each has an owner; the alert must reach the right one.
- Regulatory or contractual SLAs. Time-to-detect and time-to-resolve become contractual; the monitoring program is the source of truth.
- Continuous model and prompt churn. Every release perturbs the failure distribution; monitoring catches the shift before users do.
How it works
The loop has six stages. Each stage owns one artifact the next stage consumes.
Stage 1: Capture session and span data
Every session emits structured events: inputs, outputs, intermediate model calls, tool invocations, retrieval spans, judge scores, user actions. The capture layer is plumbing; the requirement is that every event carries the metadata downstream stages need (session ID, user tier, model and prompt version, tool version, rubric version).
OpenTelemetry GenAI semantic conventions are the safe default; teams that roll their own schema generally re-implement the same fields six months later.
Stage 2: Detect regressions from behavior, not just from errors
Detection runs on every session. The signals come from three sources:
- Hard errors. Exceptions, schema violations, tool failures, guardrail trips.
- Behavioral signals. Low judge score, near-guardrail-trip, user re-submission, escalation, premature termination, latency outliers, context-window pressure.
- Aggregated signals. Rolling drop in task completion on a workflow, rolling rise in tool-error rate by tool, drift in judge score distribution.
Each signal is tagged with severity at detection time. Three tiers are usually enough:
- P0: safety breaches, full workflow outages, policy violations with user impact.
- P1: sustained quality regression on a core workflow.
- P2: low-impact anomalies for the backlog.
Severity is the only field the queue can sort on without re-reading every trace; tagging at the source is what makes triage scale.
Stage 3: Prioritize by cluster impact
Single incidents do not deserve tickets; clusters of incidents do. Useful cluster keys are failure class plus workflow segment, plus model and prompt version, plus user tier. Order the queue by severity multiplied by frequency multiplied by reach. P0 jumps the queue regardless of cluster size.
Every cluster has one assigned owner from the moment it enters the queue. Unowned clusters are the single most common source of program decay.
Stage 4: Respond with runbooks
Each cluster type has a runbook. A useful runbook has four fields:
- The likely root-cause hypotheses, in order of frequency.
- The mitigation that buys time (rollback, feature flag, guardrail tightening).
- The fix surface (prompt, retriever, tool, guardrail).
- The regression artifact the fix must produce.
Without runbooks, every incident starts from zero; with them, response time is bounded by the slowest mitigation in the list.
Stage 5: Validate the fix with regression coverage
A fix is incomplete until two artifacts exist:
- A versioned regression case in the calibration dataset that exercises the failure pattern.
- An evaluator that scores the case and that runs on every release.
Both are versioned. The evaluator and the rubric live in the same artifact set as the prompts they protect.
Stage 6: Learn by promoting incidents into the eval suite
The closed loop is the part most teams skip. Every confirmed incident becomes a candidate for the evaluator suite that gates the next release; every promoted evaluator runs continuously, and its score is itself a monitored signal. Two metrics close the loop:
- Recurrence rate per failure class. Trending down means the loop is compounding.
- Pre-release catch rate. Non-zero means the eval suite is catching what monitoring previously caught only after the fact.
The six metrics that drive decisions
Pick once. Review every week.
| Metric | Definition | Decision it drives |
|---|---|---|
| Task completion rate, by workflow | Fraction of sessions reaching the intended end state | Whether a workflow is healthy enough to remain on auto-deploy |
| Mean time to detect (MTTD) | Time from incident-causing event to alert | Whether detectors need new signals |
| Mean time to resolve (MTTR) | Time from alert to deployed fix | Whether runbooks need updating |
| Recurrence rate per failure class | Fraction of fixed clusters that re-emerge within N weeks | Whether the fix generalized |
| Policy incident rate | Confirmed policy violations per 1000 sessions | Whether guardrails need tightening or expanding |
| Alert precision | Fraction of alerts that turned out to be real incidents | Whether thresholds need re-tuning |
Six metrics fit on one page, drive concrete decisions, and resist the dashboard sprawl that buries the real signal.
Alerting strategy
Three rules keep the alert program useful.
- Every alert carries severity, ownership, and first-action. No alert ships without those three fields.
- No paging on aggregate trends. Trends go to the weekly review; pages go to threshold-crossings with user impact.
- Alert precision is a metric. Track it, and retire alert classes whose precision falls below a threshold (50 percent is a reasonable floor).
Tying alert classes to failure classes (one detector per class, not per signal) is what stops the cardinality from exploding.
Triage model for scale
Triage is a clustering operation, not a dispatching operation.
- Cluster. Group recent incidents by failure class plus segment plus version.
- Rank. Order by severity x frequency x reach.
- Assign. Owner from the surface mapping.
- Track. Status, blockers, ETA, and the regression artifact in flight.
The triage queue is the operational object of the program. If the queue is not the source of truth, the dashboard becomes the source of truth, and that is when the program starts to drift.
Weekly cadence
- Daily standup, 15 to 20 minutes. P0 and P1 status; new clusters; validation of mitigations from the prior day.
- Weekly reliability review, 45 to 60 minutes. Tune thresholds, retire low-precision alerts, promote incidents into the eval suite, review the six metrics.
- Monthly audit, 90 minutes. Runbook health, ownership mapping, evaluator coverage, dataset health.
The cadence is light because the metrics are few. The metrics are few because the cadence is light. Both fall apart together if either is overgrown.
Example
A team operates a customer-support agent with retrieval grounding and three tools (knowledge-base search, ticket lookup, escalation). Volume: 22,000 sessions per week.
Metrics. Six metrics on one page; reviewed every Tuesday.
Detectors. Nine alert classes, each tied to a failure class in the taxonomy. P0 fires on policy-breach detections and on workflow-completion drops below 90 percent over a 30-minute window. P1 fires on cluster-level regressions in judge scores. P2 collects anomalies for the backlog.
Top cluster, week 1. TOOL_ARGUMENT_ERROR on ticket lookup, 8.4 percent of sessions touching the tool. Owner: tool-integration engineer. Runbook step 1: roll back the previous day's argument-schema change. Step 2: add explicit type-coercion in the agent's tool-call template. Step 3: add a regression test exercising the prior failure case; add a judge that scores tool-call argument schema validity on every release.
Week 8 outcome. TOOL_ARGUMENT_ERROR cluster recurrence down to 0.7 percent. The argument-validity judge has fired four times on pre-release builds and blocked all four. The team's top cluster is now CITATION_MISMATCH at 4.1 percent; they rotate ownership and start the next loop.
Audit trail. Every alert, cluster, owner, mitigation, fix, regression test, and evaluator is recorded against versions. A risk officer asking "what was our tool failure rate on enterprise tenants in week 4?" receives a specific answer with the supporting evidence.
Limitations
- The six metrics are starting points. Adapt them to the agent's surface. The discipline is "small, weekly-actionable set," not the specific six.
- Alert precision is hard to keep above 70 percent. Below 50 percent, fatigue dominates. Be ruthless about retiring detectors.
- Cluster keys leak. New model versions can split or merge clusters. Re-cluster on every release.
- Ownership without authority is theater. Owners need the ability to ship the fix; otherwise the queue grows.
- Eval feedback is the part teams skip. Without it, monitoring is a video feed; with it, it is a compiler that turns incidents into release gates.
Evidence and sources
- OpenTelemetry semantic conventions for GenAI. opentelemetry.io/docs/specs/semconv/gen-ai
- Google SRE workbook: alerting on SLOs. sre.google/workbook
- Anthropic: Building effective agents. anthropic.com/research/building-effective-agents
FAQ
Why exactly six metrics? Because that is the number a team can carry between reviews without dropping any. Pick fewer if you can; do not pick more.
Why not page on judge-score drift? Drift is a trend, not an event. Pages go to events; drift goes to weekly review.
Who owns alert precision? The on-call engineer in the rotation owns the alert class they ship; the program manager owns the aggregate. Both numbers are reviewed weekly.
How do I know my evaluator suite has good coverage? For every failure class with one or more confirmed incidents in the past quarter, an evaluator exists. Coverage is binary per class; the metric is the count of uncovered classes.
Can monitoring replace pre-deploy evaluation? No. Monitoring catches what is already in production. Evaluation catches what would be in production. The two are complementary; one cannot substitute for the other.
What if my agent has no clear "workflow"? Workflows are a useful framing even for free-form chat; they are just the high-level user goals you can name. If you cannot name any, monitoring degrades to error-rate watching, which is half the program at best.
Related reading
- How do you detect, triage, and eliminate agent failures?
- How to Build Eval-Driven AI Observability for Agents
- Agent Observability and the Complexity of Agentic Systems
- AI Agent Monitoring for Heads of AI: KPIs, Drift as Operational Signal, and Issue-Centric Quality
- How do you debug AI agents in production?