Updated: 2026-01-23 By: Ari Heljakka
Short answer
Agent observability platforms make three architectural choices that dominate everything else: how they intercept traffic (proxy in front of the model, SDK in the application, or OpenTelemetry collector on the side), how they run evaluators (inline and synchronous, asynchronous and batched, or fully offline), and how they sample traffic (full capture, head-based sampling, or tail-based sampling driven by evaluator scores). Each combination produces a different cost profile, a different latency contract, and a different audit story. The feature lists across vendors look similar; the architectures do not, and the architecture is what bites first in production.
Key facts
- Definition: Architectural tradeoffs in agent observability are the choices a platform makes about interception, evaluation timing, sampling, storage, and gating; each choice is roughly orthogonal but compounds with the others.
- When to use: Whenever a platform decision will outlive any single agent build. Architecture determines lock-in, how widely a single failure can take the system down, and what is queryable later.
- Limitations: Architecture papers over neither bad evaluators nor bad rubrics. A clean architecture with no calibration data still produces unreliable scores.
- Example: Two platforms both offer "LLM-as-a-judge on production traffic." One runs the judge inline and pushes p95 latency by 800 ms. The other runs the judge asynchronously on a 5 percent sample and exposes drift as a metric. Same feature, different operational shape.
Key takeaways
- Interception model (proxy, SDK, collector) decides whether the platform sees a request before or after your code runs, and how much it can rewrite.
- Evaluation timing (sync, async, offline) decides whether judge calls add to user-facing latency, what the failure mode is when the judge is down, and whether scores can gate responses.
- Sampling strategy (full, head-based, tail-based) decides what is auditable later and how much you pay per million tokens evaluated.
- Storage model (open columnar, vendor blob, hybrid) decides who can query the data, how long it lives, and whether you can export it without a migration project.
- The architectural choice constrains the philosophy: a platform that captures every span but treats scores as another span attribute cannot easily separate the objective from its implementation.
Definition
An agent observability platform is the system of record for what an autonomous or semi-autonomous LLM application did during a run: which model was called, with which prompt, against which retrieved context, with which tool invocations, and what scores (if any) were attached. The platform's architecture is the set of decisions that govern how that record is created, stored, scored, and queried.
The decisions are not equal in weight. Most user-visible features (a trace viewer, a metrics dashboard, a judge catalogue) sit on top of a small number of foundational choices. Those choices are usually invisible in marketing material and critical in production.
The five that matter most:
- Interception model, how the platform sees traffic.
- Evaluation timing, when judges run relative to the user request.
- Sampling strategy, how much traffic is captured and at what fidelity.
- Storage model, where the data lives and who can query it.
- Gating model, whether scores can block deploys or only inform humans.
The rest of this post takes each in turn.
When this matters
The architectural shape becomes decisive at the moments when the platform fails or when the team needs something the platform was not designed to do:
- The judge model has an outage and the platform is configured to run evaluators inline. User-facing latency spikes, then user-facing errors.
- A regulator asks for the rubric version and judge version that produced the score on a flagged response from three months ago. The platform stored the score as a span attribute and the rubric as YAML in someone's repo.
- The team wants to switch orchestration frameworks. The platform's SDK only instruments the old one cleanly.
- Cost grows non-linearly with traffic because every span runs through an inline judge.
- An incident reveals that head-based sampling discarded exactly the traffic shape that caused the incident.
If none of those scenarios is plausible, the architectural choice is not yet decisive. If any is, it dominates.
How it works
Interception model
Three patterns:
- Proxy in front of the model. The platform sits between the application and the model provider. Every model call passes through it, which makes capture automatic and rewrite (caching, redaction, routing) possible. The price is a new component in the critical path: another network hop, another failure mode, another set of keys.
- SDK in the application. The platform ships a library imported by the application; the library wraps model and tool calls. Capture is precise (the application controls what is wrapped) but framework-coupled: each orchestration library wants its own integration shape.
- OpenTelemetry collector on the side. The application emits OTLP spans following the GenAI semantic conventions. A collector receives them, the platform ingests from the collector. Capture is vendor-neutral and outlives any single platform, at the cost of more upfront instrumentation and a less opinionated default experience.
A platform built on a proxy assumes it controls the request. A platform built on an SDK assumes it controls the runtime. A platform built on OTLP assumes it controls neither and reads whatever the instrumentation produced. Each assumption shows up later in the contract.
Evaluation timing
Three patterns:
- Synchronous and inline. The evaluator runs in the request path, before the response is returned. Scores can gate the response (refuse, rewrite, retry). The cost is added latency and a new failure mode: if the judge is slow or down, user requests are slow or down.
- Asynchronous and post-hoc. The evaluator runs after the response, against the captured trace, on a queue. Latency contract is preserved; scores become a metric and a flag rather than a gate. Loss of the judge means missing scores, not missing responses.
- Offline and curated. The evaluator runs against a fixed evaluation set on a schedule (nightly, per-deploy). Production traces are not scored directly; the curated set is the system of record. This is the cheapest and most reproducible, and the furthest from production.
A platform that promises both sync and async usually has to pick one as primary. The sync path is harder to make reliable; the async path is harder to make actionable.
Sampling strategy
Three patterns:
- Full capture. Every span is stored. Cost grows linearly with traffic. Audit is easy: anything that happened can be reconstructed.
- Head-based sampling. A decision at the start of the trace determines whether the whole trace is captured. Cost is bounded, but the sampled population may not include the rare failures.
- Tail-based sampling. Spans are buffered, then a decision after the trace finishes determines whether to keep it (based on error, latency, evaluator score). Cost is bounded and the kept population is biased toward interesting traces, at the cost of a buffer and a delay.
For agent observability specifically, tail-based sampling driven by evaluator score is the pattern that pays off most often: keep every trace where any dimension scored below threshold, sample the rest. It requires the evaluator to run early enough to inform the sampling decision.
Storage model
Three patterns:
- Vendor-managed columnar store. Spans land in the platform's database. Query is fast and the UI is tightly integrated, but data export is a project and retention is bounded by the contract.
- Customer-controlled object store with vendor index. Raw spans live in customer-controlled storage (S3, GCS, customer cloud). The vendor indexes and serves them. Cheaper retention, slower queries.
- OpenTelemetry-native, store anywhere. The platform reads OTLP, and the customer chooses the store. The platform is a UI and an evaluator runtime; the data is portable by default.
The storage choice decides who owns the data in the legal sense, and how hard a migration is in the engineering sense.
Gating model
Two patterns:
- Scores inform, humans gate. Evaluator scores show up in dashboards and alerts. A human decides whether a regression is worth blocking a deploy.
- Scores gate, humans inspect. Scores feed a CI step or a deployment hook that blocks a release on regression. Humans inspect when the gate trips.
A platform that supports both still has a default, and the default shapes how teams use it. The gating model is the operational concern that surfaces the difference between observability and evaluation most clearly: observability without gates is a dashboard, not a contract.
Example
A team running a customer-support agent considers two platforms with similar feature lists.
Platform A:
- Interception: proxy in front of the model provider.
- Evaluation: synchronous, inline, can block responses.
- Sampling: full capture.
- Storage: vendor columnar store, 30-day retention by default.
- Gating: scores can block deploys via a webhook.
Platform B:
- Interception: OTLP collector, ingests GenAI spans.
- Evaluation: asynchronous, runs against a sampled fraction of production traces and a curated set per deploy.
- Sampling: tail-based, keep all traces where any score is below threshold, sample the rest at 5 percent.
- Storage: customer object store, indexed by the platform.
- Gating: evaluator scores against the curated set feed a CI gate; production scores feed alerts.
Both can claim "agent observability with LLM-as-a-judge." The operational shape is different:
- Platform A pays a latency tax on every request and a cost tax on every span. In exchange, it can refuse or rewrite a bad response before it reaches the user. The audit story is complete: every span is stored, every score is attached.
- Platform B preserves the user-facing latency contract and bounds cost, at the price of not gating responses inline. The audit story is good for traces it kept and weak for the 95 percent it discarded. The CI gate against the curated set is the deployment contract.
Neither is correct in the abstract. The choice depends on which failure mode the team can least tolerate.
Comparison
A category-level view of the five architectural axes:
| Axis | Option A | Option B | Option C |
|---|---|---|---|
| Interception | Proxy in front of the model. | SDK in the application. | OpenTelemetry collector on the side. |
| Evaluation timing | Synchronous and inline. | Asynchronous and post-hoc. | Offline and curated. |
| Sampling | Full capture. | Head-based sampling. | Tail-based sampling on score or error. |
| Storage | Vendor columnar store. | Customer object store with vendor index. | Vendor-neutral, store anywhere. |
| Gating | Scores inform, humans gate. | Scores gate CI, humans inspect. | Scores gate responses inline. |
Three patterns generalize across rows:
- Capture is a budget. Anything captured at full fidelity costs storage; anything not captured is unavailable to evaluators and unavailable to audit. The right point on the curve depends on the cost of missing a trace versus the cost of storing one.
- Evaluation timing is a latency contract. Sync gating buys the strongest control and the worst tail latency. Async post-hoc preserves latency and weakens the gate to a signal. Offline curated decouples evaluation from production at the price of being a step removed from reality.
- Lock-in lives in storage and interception. A platform that owns the proxy and the store is fastest to integrate and hardest to leave. A platform that reads OTLP and indexes customer-controlled storage is slower to integrate and survives a vendor change.
Who should not use a hosted eval-first platform
A team whose dominant question is "what happened in this run" rather than "did the run meet our objectives" is paying for evaluation infrastructure it will not use. A hosted eval-first platform shines when there are versioned rubrics, calibration datasets, and deployment gates. Without those artifacts, the value collapses to a trace viewer that costs more than a trace viewer should.
Where each category is stronger
- Proxy plus synchronous evaluation plus full capture is strongest when the operational contract is "no bad response reaches a user" and the cost of an extra hop in the critical path is acceptable.
- OTLP plus asynchronous evaluation plus tail-based sampling is strongest when the operational contract is "preserve latency, gate deploys on objectives, retain audit-quality traces of anything interesting."
- SDK plus offline evaluation against curated sets is strongest when the team's iteration loop is "change the prompt, run the eval suite, ship if it passes," and production scoring is a later concern.
Limitations
- Architectural elegance does not produce calibration. A platform with the cleanest interception model and the most flexible evaluator runtime can still produce judges that disagree with humans 30 percent of the time. The architecture must be paired with a calibration loop against a versioned ground truth dataset.
- Lock-in is not always avoidable. OTLP plus customer-controlled storage is the most portable shape, and also the slowest to get value from. Teams that need value next week usually accept some lock-in.
- Tail-based sampling has gaps. Anything kept is auditable; anything sampled out is gone. If a regulator asks for traces from a specific window, the sample is what you have.
- Inline evaluation is a soft real-time problem. A judge that runs at p99 of three seconds is a latency contract that has to be defended, including on the day the judge model is degraded.
- Scores attached to spans hide rubric versions. Without a first-class object for the rubric and the evaluator, two scores called
may come from two different judges and not be comparable across time.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, the vendor-neutral wire format that OTLP-based interception assumes.
- "Sampling at Scale: Tail-Based Sampling for Distributed Traces," OpenTelemetry SIG, https://opentelemetry.io/docs/concepts/sampling/, for the tradeoffs between head-based and tail-based sampling that carry over to agent traces.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, on why judge calibration is itself a first-class operational concern.
FAQ
Is a proxy interception model always slower than an SDK? Usually, marginally. A proxy adds a network hop measured in single-digit milliseconds on a well-tuned platform, alongside a new failure mode and a new dependency in the request path. SDKs avoid that hop at the cost of coupling to the orchestration framework instead.
If I run evaluators asynchronously, can I still block bad responses? Not for the response in question, no. You can block downstream actions (a follow-up email, an automated decision) on a score that has since arrived. You can flag the trace for review and use it to gate the next deploy. The pattern is "score the response after it ships, gate the next ship."
How much should I sample? Enough that the rare failure modes are still represented. For low-frequency, high-impact errors (a regulated response goes off-policy), tail-based sampling on evaluator score is the right shape. For high-frequency, low-impact metrics (latency distribution), head-based sampling at a small percentage is enough.
Does it matter whether storage is vendor-managed or customer-controlled? It matters for retention, cost, and exit. Vendor-managed is fast to start and bounded by contract. Customer-controlled is slower to start and survives a vendor change. The judgment is how long the platform needs to outlive the current contract.
Can the platform's architecture compensate for bad rubrics? No. Architecture decides where evaluation runs and how scores flow. The score's meaning is set by the rubric and the calibration data, neither of which the architecture produces. A clean architecture with bad rubrics produces fast, well-organized, wrong answers.
Does this matter for a single agent with low traffic? Less so. At low traffic, full capture and synchronous evaluation are affordable, the cost of any architecture choice is small, and lock-in costs accrue slowly. The architectural tradeoffs become decisive at the scale where capture, evaluation, and storage cost compound.