How to Implement LLM Observability Systems

How to Implement LLM Observability Systems

Updated: 2026-03-08 By: Ari Heljakka

Short answer

LLM observability is the practice of capturing every step of an LLM-powered request as structured traces and spans, rolling those traces up into metrics (latency, cost, error rate, quality scores), and wiring the output into continuous evaluation loops where judges score the outputs and score drift becomes the signal that drives action. It is the AI-native extension of classical observability, with three things added: trace fidelity for non-deterministic outputs, semantic span attributes for prompts and model responses, and evaluation signals from first-class evaluators (judges) that score whether the output was correct and track whether quality is degrading, not just whether the output was returned.

Key facts

  • Definition: End-to-end visibility into the behaviour of an LLM application built from traces (the full path of a request), spans (each step inside the request), metrics (aggregated numbers over time), and continuous evaluation loops where judges score traces and drift alerts trigger action (trace → score → alert → iterate).
  • When to use: Any production LLM feature with chained prompts, retrieval, tool calls, or multi-model routing; any system where wrong answers are more expensive than slow answers.
  • Limitations: Observability only works with evaluators wired into the loop (traces alone do not score correctness), does not fix non-determinism on its own, and can quickly become a cost line item without a sampling strategy.
  • Example: A retrieval-augmented chat endpoint emits a single trace per user message containing spans for the retriever, the planner LLM call, the tool call, the responder LLM call, and a sampled evaluator (judge) span that scores faithfulness and answer relevance; the scores become metrics that alert on drift from baseline quality.

Key takeaways

  • LLM observability is more than logging. It is a structured trace per request, plus rolled-up metrics, plus continuous evaluation loops where judges score traces and score drift becomes an operational signal.
  • Monitoring tells you that something broke. Observability lets you ask why, after the fact, without re-deploying. Continuous evaluation lets you see quality degrading in real time before customers complain.
  • OpenTelemetry plus a semantic convention for LLM spans gives you a vendor-neutral foundation; the GenAI semantic conventions are the right starting point. Evaluators (judges) running on sampled traces are what turns quality from a feeling into an actionable signal.
  • Quality signals from evaluators are first-class telemetry, not post-hoc analysis. Evaluator agreement, score drift, and regression detection are the operational signals that drive reliability.

Definition

LLM observability is end-to-end visibility into an LLM-powered system, structured so that any production output can be reconstructed, explained, and scored. The core insight: traces are the raw input that evaluation pipelines consume; operational success is decomposed into measurable dimensions (faithfulness, relevance, safety, correctness); evaluators (judges) are first-class scoring components that run continuously and measure those dimensions; and score drift in any dimension becomes the signal that drives action.

It combines four ingredients:

  1. Agentic traces. A single trace represents one request through the system, from the inbound API call to the final response. Trace IDs stitch everything together. Traces are the data structure that feeds evaluators.
  2. Spans. Each step inside a trace is a span: a retrieval call, a vector search, a prompt-rendering function, an LLM completion, a tool invocation, a guardrail check, an evaluator (judge) call. Spans nest to form the request's causal tree.
  3. Metrics. Spans roll up into time-series metrics: p50 / p95 latency, error rate, tokens in / out, cost per request. Quality metrics from evaluators (faithfulness score, relevance score, safety score by route and prompt version) are first-class metrics on the same footing as latency. Each metric is a normalized score between 0 and 1, enabling direct comparison and optimization across all dimensions of system performance.
  4. Continuous evaluation loops. Evaluators (judges) run on a sampled fraction of traces and attach
    pairs to outputs. Those scores roll up into metrics. Score drift from baseline (detected via moving averages, standard deviation thresholds, or evaluator disagreement) triggers alerts that drive iteration. The loop: trace → sample → score → alert → reproduce → fix → regression test.

The difference between LLM observability and traditional observability is the kind of failure each is built to surface, and the system needed to detect it. Traditional observability is tuned for "the service is down or slow." LLM observability adds a class of failures that does not exist outside AI systems: the request succeeded, the latency was fine, and the output was wrong. Detecting this class of failure requires evaluators (judges) wired into a continuous loop; that loop is the core of LLM observability.

Why this matters

Four properties of LLM applications make observability critical rather than optional.

Non-deterministic outputs

The same prompt rarely produces the same response. Sampling temperature, model updates, retrieval changes, and even minor prompt edits can shift behaviour invisibly. Without traces that capture inputs, intermediate state, and outputs, the only path to debugging is re-running the request, which itself produces a different answer.

Chained pipelines

A modern LLM feature is rarely a single completion. It is a retriever, a planner LLM, one or more tool calls, a final responder, and one or more guardrails. Any of these steps can be the failure point. Traces with nested spans make the chain visible; flat logs do not.

Hallucinations and accuracy issues

A model can confidently produce plausible nonsense. The classical 200-OK is silent here. Success criteria for an LLM system are rarely captured by HTTP status alone; they must be decomposed into measurable dimensions: faithfulness, relevance, groundedness, tool-call correctness. Those quality signals need to live next to the trace with normalized scores, or hallucinations only surface in user complaints.

Cost and performance unpredictability

Token usage varies with input, context, and model behaviour. A small prompt change can triple cost without anyone catching it before the invoice arrives. A retrieval bug can balloon context. Cost and latency need to be attributed back to specific routes, prompt versions, and model choices; otherwise the line item grows faster than the explanation.

LLM observability vs LLM monitoring

The two terms get used interchangeably; they are not the same.

  • Monitoring answers "is something wrong right now?" It is metric-led, threshold-driven, and tuned for known failure modes: error rate above X, latency above Y, daily cost above Z.
  • Observability answers "why did this specific request behave this way?" It is trace-led, exploratory, and tuned for unknown failure modes you have not seen before.

A healthy stack has both. Monitoring fires alerts. Observability lets the on-call engineer or the product owner walk back through the failing trace, read the actual prompt that went to the model, see the retrieved chunks, and identify what changed.

Core components

Traces and spans

A trace is the full record of one request. Spans are the steps inside it. Each span carries:

  • A name (
    ,
    ,
    ,
    ).
  • A start time and duration.
  • Parent and child relationships.
  • Attributes: model name, prompt version, token counts, retrieval query, top-k chunks, tool arguments, response text, and any error.

OpenTelemetry's GenAI semantic conventions standardise the attribute names for LLM spans, so a trace produced in one framework can be read by any compliant backend. This matters more than it sounds; without it, every framework invents its own field names and traces stop being portable.

Instrumentation

Instrumentation is what produces the spans in the first place. Three approaches, in increasing order of effort:

  1. Auto-instrumentation via SDKs that patch popular libraries (the OpenAI client, LangChain, LlamaIndex, common tracing wrappers). One line of setup, full coverage of the common path.
  2. Manual spans for application code that the SDK cannot see: business logic, custom retrieval, custom tools.
  3. Semantic enrichment. Add attributes that make traces searchable later: user-tier, route name, prompt hash, feature flag, experiment ID. Search and filtering only work on the attributes you remember to add.

Metrics

Metrics are the rolled-up view. Useful ones to track from day one:

  • Latency: p50, p95, p99, broken out by route and by model.
  • Throughput: requests per minute, tokens per minute.
  • Errors: rate by type (rate limit, schema validation failure, tool error, guardrail block).
  • Cost: total spend and cost per request, by model and by route.
  • Quality: judge scores (faithfulness, relevance, helpfulness, safety), broken out by route and by prompt version.

Quality is the metric class that most teams add last and miss most. Without it, the dashboard says "everything is fine" right up to the moment a customer escalates a hallucination.

Evaluators (judges) as continuous quality gates

Evaluators close the loop. They are first-class scoring components that run on a sample of traces (or on every trace, in low-volume systems) and attach a

pair to each output. Each evaluator measures a single dimension of success (faithfulness, relevance, safety, correctness) independently, enabling you to see which dimension is degrading. The score becomes a metric you can chart, alert on, and slice by route and prompt version. Critically, evaluator outputs are not post-hoc analysis—they are part of the trace itself and feed into real-time score monitoring. When baseline quality scores drift, that drift is the signal that should trigger investigation and rollback, just like an error spike or latency increase. Score drift in any dimension is treated as a first-class operational problem.

For deeper coverage of the evaluation half of the loop, see How to Build Eval-Driven AI Observability for Agents.

How it works in practice

A workable implementation has three repeating steps.

1. Instrument the application

Pick an OpenTelemetry-based tracer. Wire up auto-instrumentation for the LLM client and the orchestration framework you use. Add manual spans around any custom code. Set service-name, environment, and route attributes so the traces filter cleanly.

Verify by running a single request end-to-end and reading the resulting trace in your backend. The trace should show, in order, the inbound request, every model call, every tool call, and the final response. If a step is missing, instrument it before moving on.

2. Visualize traces and roll up metrics

Pipe traces to a backend that can show waterfall views (one span per row, time on the x-axis) and roll spans up into metrics dashboards. The waterfall view is the daily-driver for debugging a single bad output; the metrics view is the daily-driver for spotting trends.

Set baseline alerts: error rate, p95 latency, daily cost. Add quality alerts as soon as evaluators are wired up: judge score drop over N standard deviations from baseline, by route.

3. Close the evaluation-driven feedback loop

Traces and metrics are inputs to action, not the action itself. The loop is continuous evaluation with score drift as the trigger:

  1. An evaluator runs on sampled traces and scores outputs along independent dimensions (faithfulness, relevance, safety, correctness). Each evaluator is model-agnostic and measures its dimension consistently across different prompt versions, model changes, and retrieval strategies.
  2. Score drifts from baseline (moving average drops, or agreement between independent evaluators falls).
  3. An alert fires: "Faithfulness score down 0.09 from baseline on /billing route."
  4. Drill into the underlying traces. Filter by route, by model, by prompt version, by user segment, by evaluator confidence.
  5. Identify the changed dimension. New prompt version? New model? New retrieval chunking? Increased user load?
  6. Reproduce on a curated dataset. Make the change. Re-evaluate with the same judges. Ship if scores recover, roll back if they do not.
  7. Add the regression case to the eval suite so this failure is caught automatically next time.

Without step 5–7, observability becomes a wall of dashboards that nobody looks at. With continuous evaluation loops and score drift monitoring, traces and scores drive iteration directly. Evaluator agreement (does judge A agree with judge B on this trace?) is a secondary signal that surfaces when judges need retraining. This is the separation of concerns in action: operational objectives (traces show what happened) are decomposed into measurable dimensions (scores quantify whether it was correct), and independent evaluators measure those dimensions to surface regressions.

Example: a retrieval-augmented support assistant

A concrete observability setup for a support assistant that answers questions about a documentation corpus.

Trace per user message:

Metrics rolled up across a day:

  • p95 end-to-end latency: 1.18 s.
  • Error rate: 0.9% (mostly rate-limit errors on
    ).
  • Cost per request: $0.014, trending up 12% week over week (responder context size grew).
  • Faithfulness score (from judge): 0.89 mean across sampled traces, baseline 0.91 (drift alert fired: -0.02).
  • Answer-relevance score (from judge): 0.71 mean (baseline 0.80), with a sharp dip on the
    route (drift alert fired: -0.09, highest severity).

Action triggered by the score drift:

The relevance-score drift alert on

fires automatically (drop of 0.09 exceeds the threshold). The dip is filtered down to traces from the last six hours. The retriever spans show three irrelevant chunks consistently entering the context. A prompt-version filter shows the score dip starts with v18, which changed the retrieval query template. The team rolls back to v17 on that route; the judge immediately starts re-scoring incoming traces, and the mean score climbs back to 0.80 within 30 minutes. The bad version becomes a regression test in the eval suite, tied to the judge's faithfulness and relevance scorers. The regression test runs against every new prompt version before deployment.

That sequence (drift alert → trace filtering → root cause → rollback → immediate validation → regression test) is the whole point of observability with continuous evaluation. Without traces and judges, the bug shows up as user complaints two days later. With score drift monitoring, the fix ships the same day.

What to look for in an LLM observability stack

A short, practical checklist.

  • OpenTelemetry-native. Standard span format, standard semantic conventions, no proprietary SDK lock-in.
  • GenAI semantic conventions. Specifically the LLM-related attribute names:
    ,
    ,
    , and so on. Portable traces start here.
  • Full prompt and response capture. With redaction controls. A trace you cannot read is a trace you cannot debug.
  • Multi-model and multi-framework support. OpenAI, Anthropic, Google, Bedrock, Azure, plus the orchestration framework your team actually uses.
  • Evaluation integration. Evaluators (judges) as first-class components, each measuring a single dimension of success. The ability to attach normalized scores (0-1) from an external evaluator service back onto the trace, roll them up into metrics, and alert on score drift in any dimension with the same intensity you alert on latency.
  • Search across any dimension. Filter by prompt version, route, user segment, judge score, model, token range.
  • Cost attribution. Per-request and per-route cost, broken out by model.
  • Alerting. On latency, error rate, cost, and quality, with sensible defaults.
  • Score drift detection. Continuous monitoring for shifts in evaluator scores (faithfulness, relevance, safety, correctness). Score drift is the primary signal for silent quality regressions. Includes baseline tracking, moving-window comparisons, and evaluator-agreement metrics to detect when judges disagree and need retraining.

Limitations

LLM observability earns its keep, but it is not a silver bullet.

  • Traces alone do not score quality. A trace shows what happened; it does not judge correctness. Success criteria must be decomposed into measurable dimensions, and each dimension requires an independent evaluator. The fix is evaluators (judges) wired into a continuous loop, prompt versioning, and disciplined iteration based on score drift in each dimension.
  • Evaluators are the foundation. Without judges scoring sampled traces, "quality" remains a vibe and regressions hide until users complain. Evaluators are not optional; they are what makes reliability measurable. Each evaluator measures a specific dimension (faithfulness, relevance, safety, correctness) independently, and together they calibrate the system's understanding of whether a particular output meets its operational objectives.
  • Evaluator training and agreement matter. Judges need baselines and thresholds. When score drift fires an alert, the first question is whether the scores are correct (evaluator drift) or the system is broken (actual quality drift). Evaluator disagreement metrics help surface when judges are drifting out of sync.
  • Cost compounds. Full trace capture plus sampled judge calls plus long-term retention can rival inference cost if untuned. Sampling strategy and evaluator efficiency are not optional at scale.
  • PII and prompt content are sensitive. Trace contents include user inputs and model outputs. Redaction, retention windows, and access controls need to be designed in, not bolted on.
  • Drift alerts are only useful if they drive action. An alert without a clear path to reproduction, fix, and regression testing becomes noise. Wire score drift alerts to on-call rotations and review queues. Pair each drift alert with a link to the filtered traces that caused it.

Evidence and sources

FAQ

What is the difference between LLM observability and LLM monitoring? Monitoring is metric-led and tuned for known failure modes: it alerts when latency, error rate, or cost crosses a threshold. Observability is trace-led and tuned for unknown failure modes: it lets you reconstruct any single request, read the actual prompt and response, and explain why it behaved the way it did.

Do I need OpenTelemetry, or can I use a proprietary SDK? A proprietary SDK is faster to ship and slower to migrate off. OpenTelemetry with the GenAI semantic conventions keeps traces portable across backends. For anything beyond a prototype, the OTel path is the safer default.

What is the minimum useful instrumentation? One span per LLM call, one span per tool call, one span per retrieval call, parent-child relationships, and the model name, prompt version, and token counts on each span. Everything else is enrichment.

How does evaluation fit into observability? Evaluators (judges) are not post-hoc analysis—they are first-class components of the observability stack. They run on sampled traces, attach

pairs to outputs, and those scores roll up into metrics you can chart, alert on, and slice. Each evaluator measures one dimension of success independently, and the set of evaluators together decompose your operational objectives into measurable signals. Critically, score drift in any dimension becomes an operational signal on the same footing as latency or error rate. Without evaluators, the dashboard cannot tell you whether the answer was correct, only whether it was returned. With evaluators wired into a continuous loop, you catch quality regressions in real time and drive iteration via score drift alerts.

How much traffic should I sample for judge evaluations? Start at 5 to 10 percent of production traffic, biased toward suspected failure modes and high-value routes. Increase sampling on critical surfaces, decrease on stable low-risk paths. Sampling is a cost lever, not a fixed setting.

Can observability detect hallucinations on its own? No. A trace shows what happened; it does not judge correctness. Hallucination detection requires either deterministic checks (citation matching, schema validation) or LLM-as-judge faithfulness scoring on top of the trace.

What is score drift, and why does it matter? Score drift is a shift over time in the distribution of evaluator scores (faithfulness, relevance, safety, correctness)—changes in the normalized metrics that measure your operational objectives. It is the primary signal for catching silent quality regressions caused by model updates, prompt changes, retrieval bugs, or shifting user behaviour. Score drift detection—via moving averages, statistical thresholds, or evaluator-agreement metrics—is what allows teams to catch quality regressions in real time, hours or days before customers complain. When a score drift alert fires, it should trigger the same response as a latency spike or error spike: investigate, reproduce, fix, and add a regression test. Score drift in any dimension is treated as a reliability incident, not a post-hoc measurement artifact.