Updated: 2026-04-19 By: Ari Heljakka
Short answer
AI agents that pass dev-time tests still fail on multi-step tasks in production because the failures are emergent, not unit-level. Four patterns dominate: reasoning drift across long sessions, silent tool-call failures, context-window saturation, and goal misalignment between literal and intended outcomes. Detection requires session-level observability paired with versioned objectives and managed evaluators; a trace store alone shows you what happened but not whether it met the bar.
Key facts
- Definition: Production agent failure is the gap between an agent passing isolated test cases and the same agent missing user intent on multi-step, multi-turn real-world tasks.
- When to use: Any team running an agent in front of real users on tasks longer than two or three steps.
- Limitations: Failures compound nonlinearly; even 99% per-step reliability collapses over twenty steps. Detection requires deliberate instrumentation, not default logging.
- Example: A 20-step workflow at 95% per-step reliability succeeds end to end only 36% of the time; the failures cluster in patterns that engineer-written tests rarely cover.
Key takeaways
- Step-level reliability does not compose; multi-step success rates collapse fast.
- The four dominant failure patterns (reasoning drift, tool failures, context saturation, goal misalignment) are emergent and require session-level observability.
- Dev-time tests fail to surface them because synthetic inputs do not match the production tail.
- Detection needs versioned objectives, not just spans; a trace store records what happened, a scorecard records whether it met the bar.
- Prioritize annotation queues, fold labeled failures back into managed evaluators, and gate releases on the regression set those failures create.
Definition
A production agent failure is any case where the agent traverses a multi-step task without raising an exception and yet does not meet the user's intent. The defining property is that the failure is not visible at the level of a single tool call or model response. It is only visible when the session is viewed as a unit, against an objective that says what success looks like.
The unit of observation is the session. The unit of evaluation is the scored sample against a versioned objective. The unit of remediation is the labeled failure case, folded into the regression set that the next release must pass.
When this matters
Multi-step agents are now common; production failures in them are now expensive. The case for treating agent failure detection as first-class infrastructure gets stronger when:
- Tasks routinely span five or more turns or tool calls.
- The cost of an unnoticed failure is user-visible (refunds, escalations, broken workflows).
- The agent surface is broad enough that engineer-curated test cases cannot cover the tail.
- Deployment cadence is fast enough that silent regressions land before they are noticed.
If a single-prompt completion is the product, dev-time tests are usually enough. Once the product is a multi-step agent, the production tail and the failure compounding make detection infrastructure essential.
How it works
The four dominant failure patterns
Reasoning drift. The agent's plan starts coherent and diverges across turns. Early turns reflect the original intent; later turns pursue a different sub-goal or misapply an earlier constraint. The signal is a session whose final output addresses a different question than the opening turn.
Tool-call failures. A tool returns an error or malformed payload, and the agent either continues without surfacing the failure or retries in a way that compounds the error. Many are silent: the tool returned a 200 with an empty body, and the agent built the rest of the plan around the empty result.
Context-window saturation. Long sessions push earlier context out of the model's effective attention. The "lost in the middle" pattern means that mid-context information underperforms both the start and end of the window. The agent forgets a constraint stated five turns ago.
Goal misalignment. The agent completes the literal task as parsed and misses the user's actual intent. This is "specification gaming" in miniature: a sub-agent optimizes its narrow objective at the expense of the global outcome, or the model satisfies the prompt's letter while missing its spirit.
Why dev-time tests miss them
Three structural reasons:
- Synthetic inputs do not match the production tail. Engineers write the cases they imagine; users produce the cases nobody anticipated.
- Step-level evaluation hides multi-step failure modes. Unit tests on each tool call all pass; the composition still fails.
- Distribution drifts. What looked like an edge case at launch is the modal input six months later. Test sets that do not refresh against production drift become irrelevant.
Compounding math
A workflow with twenty steps at 95% per-step reliability has an end-to-end success rate of 0.95^20, around 36%. At 99% per-step, the same workflow lands at 82%. At 99.5%, around 90%. Reliability does not compose linearly; the per-step bar needed for an end-to-end SLA is much higher than intuition suggests.
Detection: prioritized queues plus managed evaluators
A workable detection pipeline has three components:
- Prioritized annotation queues. Production traces are scored for anomaly signals (unusually long sessions, tool error rates, low evaluator scores, large divergence between intent classifier and final output) and surfaced for human review in priority order.
- Human-validated failure modes. Reviewers classify each failure against a structured taxonomy (the four patterns above, plus domain-specific subtypes), and the labels become versioned ground truth.
- Regression generation. Labeled failures become test cases in the held-out evaluation set. The next release must pass them. Drift in pass rate on the failure-derived regression set is itself a tracked metric.
The annotation step is the rate-limiting step. It cannot be fully automated without losing the domain judgment that distinguishes a real failure from a quirky-but-correct output. What can be automated is everything around it: surfacing, prioritization, taxonomy, and regression incorporation.
Why a trace store alone is not enough
A trace store answers "what happened in this session." Detection requires "did this session meet the bar." The bar lives in a versioned objective with a managed evaluator, not in a span attribute. Without the objective, every score is decoration; the trace is data without a contract.
The operational loop
Observe (production traces flow in) → Surface (prioritized queues lift anomalies) → Annotate (humans label against taxonomy) → Generate (failures become test cases) → Test (CI runs against the expanded regression set) → Verify (production scoring confirms the fix held).
Each step is versioned: the taxonomy, the rubrics, the evaluators, the regression dataset. A fix verified against version 12 of the regression set carries explicit lineage back to the labeled failures that motivated it.
Example
A team operates a multi-step customer-support agent. Average session length is seven turns and four tool calls. Engineer-written tests cover 180 scripted scenarios; CI is green.
In production, the team observes:
- Refund flows succeed end to end on 67% of sessions, well below the 92% target.
- Tool error rate is 1.8% per call, putting the four-call composition near 7% failure even before reasoning issues.
- A sample audit of failed sessions shows roughly 40% reasoning drift, 25% silent tool failures, 20% context saturation, and 15% goal misalignment.
The remediation:
- A prioritized queue routes long sessions and sessions with tool errors to a human reviewer.
- Reviewers tag each failure against the four-pattern taxonomy and capture the input, the full trace, and the expected behavior.
- Each labeled failure becomes a regression test in the held-out evaluation set.
- A managed evaluator per failure pattern scores production samples nightly; per-pattern drift is alerted independently.
- The next release blocks if pass rate on the failure-derived regression set drops by more than 0.02.
Three months later, end-to-end success on refund flows is at 89%, the tool error rate has been driven below 0.4% (the failures were authentication retries, not the model), and the agent's reasoning prompts have been narrowed to keep critical constraints in the last 500 tokens of every turn.
Limitations
- Detection coverage is bounded by annotation throughput. The queue can prioritize, but the team has to label. Without budgeted reviewer time, the loop stalls.
- Failure taxonomies need maintenance. The four-pattern decomposition is a starting point; product-specific subtypes will emerge and need to be added without inflating the taxonomy into uselessness.
- Per-step reliability gains have diminishing returns. Pushing from 99.5% to 99.9% per step is far more expensive than from 95% to 99% and may not be the right investment compared with shortening the workflow.
- Compounding math punishes long workflows. Sometimes the right answer is fewer steps with better intermediate verification, not the same workflow with marginally better steps.
- Some failures look like wins on automated metrics. A confidently wrong answer can score well on tone and grounding while still missing user intent. Human spot checks remain the anchor.
Evidence and sources
- "Lost in the Middle: How Language Models Use Long Contexts," Liu et al., 2023, https://arxiv.org/abs/2307.03172, for the empirical case on context-window degradation.
- "Specification Gaming Examples in AI," DeepMind, https://deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4, for goal-misalignment patterns across reinforcement learning and agent systems.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, for the foundational case on judge agreement that anchors managed-evaluator calibration.
FAQ
Are these failures bugs in the model or bugs in the system? Almost always the system. The model has known limits (attention falloff in long contexts, occasional tool-call schema errors, susceptibility to ambiguous instructions). The agent is the system that has to handle those limits, and the failures cluster where the system did not anticipate the limit.
How do we know when a failure is "real" versus a quirky-but-correct output? By labeling against an explicit rubric tied to user intent, not by model self-assessment. The labeling is the rate-limiting step for exactly this reason; it cannot be skipped without losing the signal.
Is observability enough on its own? No. Observability tells you what happened. Detection requires comparing what happened to a versioned objective. Without the objective, every score is a guess.
Should we just shorten the workflow? Often yes. Compounding math punishes long workflows, and an explicit intermediate verification step (with its own evaluator) is frequently a better investment than another retry layer.
Where does context-window saturation actually bite? Anywhere the constraint stated at turn one matters at turn ten. Compress the running context, place hard constraints close to the model's current attention window, and treat dropped constraints as a separately tracked failure pattern.
Can we automate the annotation step? Partially. An LLM judge can score sessions against a rubric and propose a failure pattern, but the labels still need human spot-check validation to remain trustworthy. The judge accelerates throughput; it does not replace the human anchor.