How do you detect, triage, and eliminate agent failures?

How do you detect, triage, and eliminate agent failures?

Updated: 2026-01-03 By: Ari Heljakka

Short answer

Agent failures in production are not random; they cluster into a small number of recurring patterns and a long tail of one-offs. A working failure-tracking program detects incidents from severity-tagged signals (not raw logs), classifies them against a standard taxonomy, triages by user impact rather than by recency, fixes with regression coverage, and converts every recurring failure into a versioned evaluator that gates the next release. The loop is what turns a backlog of incidents into a measurable drop in failure rate over time.

Key facts

  • Definition: A structured operating model for detecting, classifying, triaging, and eliminating recurring failures in production AI agents.
  • When to use: Any agent in production with non-trivial volume, more than one developer, and a quality bar that matters to users.
  • Limitations: Without a versioned taxonomy and ownership, the program degenerates into "look at the latest alert"; without regression coverage, fixes do not stick.
  • Example: A scheduling agent with eight failure classes drops recurrence on its top class from 14 percent of sessions to 1.2 percent over six weeks using this loop.

Key takeaways

  • Failures need a versioned taxonomy before they need a dashboard.
  • Severity belongs to the incident, not to the alert; tag at detection time.
  • Every recurring failure becomes an evaluator; the evaluator gates future releases.
  • Ownership per failure class is non-negotiable; unowned classes accumulate silently.
  • The win condition is a downward trend in recurrence rate per class, not zero incidents.

Definition

A failure-tracking program is the set of practices that takes raw production signals and converts them into a steady reduction in agent failure rate over time. It has six moving parts: detection, classification, clustering, triage, fix, and regression. Each part is owned, versioned, and measured. The deliverable is not a dashboard; it is a downward trend on a small number of metrics that the team agrees represent reliability.

The program sits between observability (which captures signals) and evaluation (which gates releases). Observability tells you what happened; evaluation tells you whether a release meets the bar; failure tracking is the loop that converts the gap between them into operational improvements.

When this matters

  • Multi-step agents with tool use. Tool-call failures, argument errors, and context drift cluster into recurring patterns that a request-level metric misses.
  • Production volume above ad-hoc review. Once you cannot read every trace, you need a classifier and a triage queue.
  • Multiple owners. When several engineers ship to the same agent, taxonomy and ownership prevent finger-pointing and double-fixes.
  • Regulated workloads. Audit and compliance need a record of what failed, what was changed, and what the change did.
  • Long-running agents. Failure patterns shift as upstream models update, as users discover new use cases, and as tools change. A static dashboard becomes stale; a loop stays current.

How it works

The loop has six stages. Each stage produces an artifact the next stage consumes.

Stage 1: Detect with severity at the source

Detection runs on every session, not on a sample. Three classes of signal feed the queue:

  • Hard failures. Exceptions, tool errors, schema violations, policy guardrails tripped.
  • Behavioral signals. User re-submission, low judge score, near-guardrail-trip, premature termination, context-window pressure, latency outliers.
  • External feedback. Thumbs-down, escalation to a human, support ticket linked to a trace.

Each signal carries a severity tag from the moment it is captured. P0 covers safety breaches, outright outages, and policy violations with user impact. P1 covers sustained quality regressions on a core workflow. P2 covers low-impact anomalies that go to a backlog. Severity at the source matters because it is the only field the triage queue can sort on without re-reading every trace.

Stage 2: Classify against a standard taxonomy

A working taxonomy has between six and twelve classes. Six that hold up across most agents:

  • CONTEXT_DRIFT. The agent forgot, misinterpreted, or selectively recalled context.
  • TOOL_CALL_FAILURE. The tool returned an error, timed out, or was unavailable.
  • TOOL_ARGUMENT_ERROR. The agent invoked a tool with wrong or missing arguments.
  • GROUNDING_FAILURE. The output claims something the retrieved context does not support (hallucination).
  • POLICY_BREACH. The output violated a stated policy (PII leak, restricted topic, unsafe action).
  • RELEASE_REGRESSION. A behavior that worked on the prior release no longer works.

Classification happens at triage time, not at detection time. Auto-classifiers (small LLM judges, regex patterns over tool errors) propose a class; a human confirms during the daily review for P0 and P1.

Stage 3: Cluster into patterns

Clusters are signatures across recent traces in a single class. Useful cluster keys:

  • Failure class plus workflow segment ("CONTEXT_DRIFT in summary stage")
  • Class plus model version, prompt version, or tool version
  • Class plus user-impact profile (paying account, enterprise tier, anonymous)

Clusters are the unit of work; individual traces are evidence. A cluster of 50 instances of the same TOOL_ARGUMENT_ERROR is one ticket, not 50.

Stage 4: Triage by impact

The triage queue is ordered by impact, not by recency. Impact is severity multiplied by frequency multiplied by reach. P0 incidents jump the queue regardless of cluster size; P1 clusters compete on volume; P2 clusters wait until a weekly review.

Every cluster has an assigned owner from the moment it enters the queue. Unowned clusters are the most common single source of program decay; they accumulate until the queue becomes ignorable.

Stage 5: Fix with regression coverage

A fix has three required parts:

  • The change. A prompt edit, a tool patch, a guardrail addition, a context-window adjustment.
  • A regression test. At least one example from the cluster is added to a versioned dataset that the release gate runs against.
  • A judge or check. An evaluator that detects the failure pattern at gate time, so a future regression blocks the merge.

The third part is what makes the program compound. Without it, fixes stop the bleeding for the current cluster and leave nothing behind. With it, every fix permanently adds protection.

Stage 6: Validate and learn

After deployment, the same evaluator that gated the fix runs continuously in production. Two metrics close the loop:

  • Recurrence rate per class. Does the failure show up again in the next two weeks?
  • Pre-release catch rate. Did the gate catch any regressions before they shipped?

A class with a falling recurrence rate and a rising catch rate is being eliminated. A class with a flat recurrence rate has a fix that did not generalize; reopen it.

Example

A six-engineer team operates a contract-review agent. The agent extracts terms, flags risk clauses, and drafts negotiation notes from uploaded PDFs. Volume is roughly 9,000 sessions per week.

Detection. Every session emits a structured event: tool calls, intermediate outputs, judge scores (faithfulness, completeness), user actions (re-runs, edits, exports). Six detectors fire P0 to P2 signals.

Taxonomy. Eight classes, including the six standard ones plus REDLINE_DRIFT (the model proposed a redline contradicted by earlier turns) and CITATION_MISMATCH (a cited clause does not match the retrieved span).

Triage cadence. Daily 20-minute standup reviews P0 and P1 clusters; weekly 60-minute review covers P2 backlog and tunes alert thresholds.

Top cluster, week 1. CITATION_MISMATCH, 14 percent of sessions, owner assigned. Root cause: retriever returns top-5 spans, model cites any of them even when only one matches. Fix: tighten retriever rerank threshold, add a judge that scores each citation against its span, gate releases on 0.92 citation accuracy on a 120-example regression set.

Week 6 outcome. CITATION_MISMATCH recurrence rate down to 1.2 percent of sessions. The judge has fired twice on pre-release builds, both blocked. The team rotates to the next top cluster (REDLINE_DRIFT, currently at 6.8 percent).

Audit trail. Every incident, classification, owner, fix, regression test, and gate score is stored against a rubric version. A regulator asking "what was the failure rate on contract Y in week 4?" receives a specific answer with the supporting traces.

Limitations

  • Auto-classifiers drift. The judge that proposes failure classes degrades as failure patterns evolve. Recalibrate against human-labeled samples every few weeks.
  • Severity tagging is a discipline, not a tool. If detectors emit everything at P1, the queue is useless. Owners must push back on misuse.
  • Regression sets grow. A protected failure class with a 50-example regression set is cheap; 50 classes at 50 examples each is a meaningful test budget. Prune by deduplicating cases that the same evaluator catches.
  • Cluster keys leak. A new model version that subtly changes failure shape can split one cluster into many. Re-cluster every release.
  • Fix-and-forget is the default failure mode. Without the regression and the evaluator, the loop is open and recurrence is invisible until users complain again.
  • Ownership without authority is theater. Owners need the ability to ship the fix or the program stops.

Evidence and sources

FAQ

How many failure classes should the taxonomy have? Between six and twelve. Fewer and the classes lose discriminative power; more and the triage cost dominates the fix cost.

Who owns a cluster? Whoever owns the surface area where the failure originates: prompt, retriever, tool, guardrail, judge. If ownership is unclear, the program manager assigns; the assignment is recorded.

What if a failure spans classes? Pick the dominant class and tag the others as secondary. Multi-class failures often resolve to a single root cause; one owner is faster than three.

Should detectors block releases? No. Detectors run on production and write to the queue. Release gates are the evaluators that came out of past fixes. Blocking releases on raw detectors causes deploy fatigue.

How do I know the program is working? The recurrence rate per class trends down over weeks, the pre-release catch rate is non-zero, and the time from cluster identification to fix shrinks. A flat or rising recurrence rate means the loop is leaking somewhere.

Can I run this on a single-engineer team? Yes, with smaller taxonomy and lighter cadence. The pieces (taxonomy, severity, regression set, evaluator) matter more than the headcount.

Related reading