Updated: 2026-01-28 By: Ari Heljakka
Short answer
Debugging an AI agent in production is fundamentally different from debugging a single LLM call. Failures live at the session level, not the request level: a tool returns empty data three steps ago, an early context window corrupts every later decision without anyone noticing, and the visible failure is far from the root cause. The workflow that holds up under real traffic uses four primitives (full session trace reconstruction, semantic failure clustering, multi-turn simulation, and production-to-evaluation pipelines) connected by versioned evaluators that score traces on orthogonal dimensions. Every failure that does not become a regression test is a failure waiting to recur.
Key facts
- Definition: Production agent debugging is the practice of capturing complete session traces, clustering failures by semantic pattern, simulating multi-turn scenarios for non-determinism, and converting observed failures into permanent regression cases scored by versioned evaluators.
- When to use: Any multi-turn agent in production, especially tool-using agents, retrieval agents, or agents whose outputs feed downstream automation.
- Limitations: Requires upstream investment in trace fidelity, schema discipline, and a ground-truth dataset; non-deterministic execution paths mean reproduction is statistical, not exact.
- Example: A support agent fabricates a refund policy; the trace reveals a retrieval span returned zero documents three turns ago, the cluster groups it with twelve similar policy-fabrication failures, and a multi-turn simulation reproduces the failure 9 out of 20 runs before the fix.
Key takeaways
- Agent failures are session-level. Request-level dashboards hide the root cause and surface only the visible symptom.
- A retrieval that returns zero results is technically successful and is one of the most common silent failure modes.
- Issue clustering by semantic pattern (not by error code) turns a flat list of bad traces into a triage queue ordered by impact.
- Multi-turn simulation absorbs non-determinism: run a scenario 20 or more times and read the pass rate, not a single result.
- Every reproduced production failure should land in the ground-truth dataset as a permanent regression case, scored by a versioned evaluator on the dimension that broke.
Definition
Production agent debugging is the operational workflow for finding, reproducing, and preventing failures in agentic systems. It rests on the recognition that an agent is not a function call: it is a state machine over multiple model calls, tool calls, and intermediate plans, where the bug at step 7 is usually caused by something that happened at step 2.
A "debugger" for this world is not a stepper; it is a pipeline. Traces feed clustering; clusters feed a candidate fix; the fix runs against a multi-turn simulation; the failure becomes a versioned evaluation case that gates every future change. That loop is what makes production agent debugging tractable instead of vibes-based.
When this matters
The debugging primitives below earn their keep when the agent has any of these properties:
- Multi-turn state that persists across model calls, with the next decision depending on prior context.
- Tool use where external APIs return data the agent must interpret.
- Non-deterministic execution paths where the same input takes different code paths across runs.
- Domain-specific correctness where success is defined by an expert, not by a return code.
If your agent answers a single prompt with a single output and never calls a tool, request-level debugging is enough. The moment any of those four properties enters the picture, request-level debugging stops scaling.
How it works: failure modes and debugging primitives
Production agent debugging breaks into two layers: a recognition layer (the five session-level failure modes that recur) and a workflow layer (the four primitives that handle them). Both are below.
The five session-level failure modes
Failures in production agents cluster into five recurring categories. Each has a different root cause and a different fix; conflating them produces the wrong remediation.
1. Multi-turn state corruption
Incorrect information enters the context window early (a wrong user fact, a stale retrieval, an ambiguous tool result) and silently shapes every later decision. No single step looks wrong; the conversation as a whole drifts.
Root-cause work: trace back from the visible failure to the earliest step where the context first contained the corrupting fact. The bug is usually in the step that wrote it, not the step that surfaced it.
2. Tool-use failures
External APIs return empty or malformed data and the agent treats success as success. A retrieval that returns zero results is technically successful and produces the most dangerous outputs: the agent fabricates content to fill the gap.
Root-cause work: instrument tool spans with both the request and the structured response. Add semantic assertions on the response shape (non-empty, expected schema, plausible value range) before the response is handed back to the planner.
3. Non-deterministic decision paths
Identical inputs produce different execution sequences across runs. Reproduction is probabilistic; a single test pass means nothing.
Root-cause work: simulation, not single-run reproduction. Run the candidate scenario many times and measure the pass rate.
4. Error propagation
Small inaccuracies compound across steps. By step 10 the agent is operating on a fact that was 92 percent right at step 1 and 51 percent right by step 5.
Root-cause work: per-step confidence scoring and early-step evaluators. Catching the drift at step 2 is cheaper than catching it at step 10.
5. Evaluation misalignment
Pre-deployment evaluations pass while production fails. The test suite measures the wrong dimensions, the wrong distribution, or only the easy slice of the input space.
Root-cause work: pull failing production traces into the calibration dataset; rerun the evaluator over the expanded dataset; watch the judge agreement with human labels as its own metric.
The four debugging primitives
The workflow that handles all five failure modes rests on four primitives. Each is necessary; none is sufficient alone.
Primitive 1, full session trace reconstruction
Every model call, tool call, retry, and intermediate plan is captured as a span and stitched into a single trace per session. Tool results are first-class events with the full structured response, not free-text summaries. The trace is the artifact you reach for the moment a score drops or a user complains.
A trace that omits the tool response, the retry attempts, or the planner's intermediate output forces every debugging session to start with guesswork. Invest in trace fidelity upfront; under-instrumented traces are the silent tax on every later incident.
Primitive 2, semantic failure clustering
A list of 400 failing traces is not actionable. A list of seven failure clusters (each with a short semantic label and a representative trace) is. Clustering groups failures by pattern (similar tool call sequences, similar prompt structures, similar error shapes), so the team can prioritize by impact instead of by recency.
Cluster on the dimensions that map to root causes: tool call sequence, retrieval-empty events, judge-flagged dimensions (faithfulness drop, tone drift), and version (prompt, model, tool catalog).
Primitive 3, multi-turn simulation
Non-determinism breaks single-run reproduction. The fix is to run each candidate scenario 20 or more times with behavioral assertions, and to read the pass rate as the regression signal. A scenario that passes 18 of 20 runs is qualitatively different from one that passes 4 of 20.
Simulation runs are also where new failure modes surface. A scenario that passes 14 of 20 reveals a tail of behavior the original eval set never captured.
Primitive 4, production-to-evaluation pipelines
Every reproduced production failure should become a permanent regression case in the ground-truth dataset, scored by the same versioned evaluator that runs in CI/CD. The pipeline has four steps:
- Trace selected from production (by alert, cluster, or user report).
- Ground-truth label added by a domain expert (the correctness target).
- Case added to the versioned dataset with the affected dimension tagged.
- Evaluator rerun across the dataset; pass-rate floor enforced as a deploy gate.
This is the loop that turns one-off firefighting into operational quality. A platform that does not let production traces flow back into the calibration dataset forces this pipeline to be rebuilt by hand.
Example: a fabricated refund policy
A support agent confidently quotes a refund policy that does not exist. A user complains.
- Trace reconstruction. The agent's session trace shows seven turns. Turn 3 fired a tool call to
that returned
. The agent's next turn synthesized a "30-day refund window" from background priors. The retrieval span was technically successful (HTTP 200, zero hits). - Clustering. The trace is grouped with eleven similar fabrications, all sharing the pattern
. The cluster is large enough to prioritize. - Simulation. A scenario template ("user asks about an unfamiliar policy area") runs 20 times against the live agent. It produces a hallucinated policy in 14 of 20 runs.
- Production-to-eval. Twelve representative traces become regression cases tagged with the
and
dimensions. The evaluator's per-dimension floor for the merge gate is raised. - Fix. A planner-level assertion is added: empty retrieval triggers an explicit "I do not know" response. The scenario re-runs: 19 of 20 now respond correctly. The evaluator on the regression set is now green; the merge gate passes; the cluster shrinks in subsequent days.
The bug was at turn 3, two turns before the visible failure, and the fix lives in the planner, not in the model. Without trace-level reconstruction and clustering, the team would have spent the week tweaking the model prompt to no effect.
Limitations
- Trace fidelity costs upfront. Full structured tool responses, retry events, and planner outputs are not free to capture; instrumentation, redaction, and storage are real engineering work.
- Simulation is statistical. A 20-run sample is more meaningful than a 1-run reproduction, but it is still a sample. Critical paths benefit from larger runs.
- Clustering depends on the dimension set. Bad cluster axes produce bad clusters. The clustering work itself is an ongoing calibration exercise.
- Ground-truth labeling is the rate-limiter. Domain-expert time is the bottleneck for the production-to-evaluation pipeline. Prioritize traces by cluster impact, not by recency.
- Evaluators drift. Judge models change, prompts change, and a once-calibrated judge can slip without obvious symptoms. Treat judge agreement with human labels as its own continuously monitored metric.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace-and-span model that lets session reconstruction work.
- Anthropic, Test and evaluate, https://docs.anthropic.com/en/docs/test-and-evaluate/, for the rubric-and-judge model behind versioned evaluators.
- Microsoft, AI agent debugging patterns, https://learn.microsoft.com/en-us/azure/ai-services/agents/, for the multi-turn state and tool-call shapes that drive the five failure modes.
FAQ
Why is request-level debugging not enough for agents? Because the bug at turn 7 is usually caused by something at turn 2. Request-level dashboards show the symptom and hide the cause. Session-level traces stitch the calls together so the root cause is visible.
How do I cluster failures if I do not have a clustering tool? Start with three axes: tool call sequence, version (prompt, model, tools), and the judge-flagged dimension. A spreadsheet or a notebook with embeddings over the trace summary is enough to start; the value is in the axes you pick, not the tool.
How many simulation runs are enough? Twenty is a useful default for spot-checks. Hundreds are warranted for critical paths. The exact number is a cost-versus-confidence trade-off; the wrong number is one.
How do I prevent production-to-eval pipelines from polluting the dataset? Treat the dataset as versioned. Every added case is signed off by a domain expert; the addition itself is a code change; the calibration of judges against the new dataset is re-run before the dataset is promoted.
Do I need an LLM-as-judge to debug? Not for every step. Deterministic checks (schema validation, exact match, non-empty assertions on tool results) cover a large share of the surface. LLM judges become essential for open-ended dimensions: faithfulness, helpfulness, tone.