Updated: 2026-01-10 By: Ari Heljakka
Short answer
Agent observability platforms cluster into four archetypes. Eval-first platforms organize around versioned objectives and managed evaluators. Framework-coupled platforms organize around the agent SDK that emits the traces. Open-source tracing libraries organize around the raw span. Workbench-style tools organize around the experiment as the unit of work. The right composition depends on which abstraction owns your deployment gate and which is acceptable to assemble from glue. Vendor names sit inside these archetypes; the archetype is what dictates how the platform behaves at scale.
Key facts
- Definition: Agent observability is the practice of capturing agent runs (sessions, turns, tool calls) and scoring them against success criteria so that production behavior is queryable, gateable, and auditable.
- When to use: Any agent deployed beyond a prototype; the surface and tail expand fast and silent failures compound.
- Limitations: No archetype covers every concern natively; production stacks routinely compose two or three.
- Example: A regulated multi-step agent might use an eval-first platform to own the deployment gate, an open-source tracing layer for raw span capture, and a separate annotation surface for human review.
Key takeaways
- The archetype dictates which question the platform answers natively and which it answers with glue.
- Eval-first platforms own deployment gates, audit lineage, and per-objective drift alerts.
- Framework-coupled platforms accelerate teams committed to one SDK and bind them as the surface grows.
- Open-source tracing is a substrate, not a deployment gate; expect to build the eval layer or compose with one.
- Workbench-style tools excel at experiment iteration and lag on continuous production scoring.
- Versioned rubrics, managed evaluators, and per-dimension scorecards are the discipline; the platform is the artifact that makes the discipline cheap.
Definition
An agent observability platform captures agent runs and exposes them for querying, scoring, debugging, and alerting. The architectural variation is in what the platform treats as primary:
- A versioned objective with managed evaluators (eval-first).
- A framework session as the agent SDK defines it (framework-coupled).
- A raw span with no opinion about scoring (open-source tracing).
- An experiment with controlled inputs, candidate prompts, and scored outputs (workbench-style).
Surface features (trace viewer, evaluator catalogue, dashboard, alerts, experiment runs) can exist in any archetype. The difference is which abstraction is foundational and which is bolted on.
When this matters
The choice becomes decisive when one of these is true:
- Deployments must be gated by per-dimension score regressions, not just unit tests.
- Compliance or regulator review requires explicit lineage from a decision back to a versioned rubric.
- The same objective is enforced by multiple implementations (rule, judge, human) and must remain portable across them.
- Drift on a specific dimension must trigger alerts independent of trace volume.
- The team expects to swap models, prompts, or frameworks while keeping the evaluation framework constant.
If none holds, lightweight tracing with sampled evaluators is plenty. If any holds, the archetype dictates whether the work is natural or a constant fight against the tool.
How it works
Archetype 1: eval-first
Primary abstraction: versioned objective. Evaluators are managed components with pinned model, prompt, threshold, and ground truth dataset. Scores are 0 to 1 across orthogonal dimensions; aggregate scorecards compose from versioned weights. Traces are inputs the evaluators score, not the product.
What is native:
- Rubric and judge versioning with explicit lineage.
- CI gating on a per-dimension scorecard.
- Calibration against human labels as an ongoing metric.
- Per-objective drift alerts independent of span volume.
- Model and framework portability: the same scorecard runs against any underlying implementation.
What is secondary:
- Raw span exploration as the primary surface, if the team wants APM-style trace browsing.
- Zero-setup coverage before any objective is written.
Archetype 2: framework-coupled
Primary abstraction: session as the agent SDK defines it. Instrumentation is automatic; turns, tool calls, sub-agents, and state are emitted because the framework emits them. Evaluators are usually built-in and tied to the framework's data model.
What is native:
- Zero-glue setup for teams committed to the framework.
- Native handling of the framework's agent primitives (planner, executor, sub-agent, memory).
- Quick wins on replay and debugging within the framework's world.
What is awkward:
- Portability when the team adds a second framework.
- First-class evaluator versioning when the framework prioritizes the agent abstraction over the scoring abstraction.
- Coverage of the same objective across multiple implementations (rule, judge, human review).
Archetype 3: open-source tracing
Primary abstraction: raw span, no opinion about scoring. The library captures spans (OpenTelemetry-flavored or otherwise) and exposes them for storage, query, and visualization. The eval layer is the team's responsibility.
What is native:
- Vendor independence and self-hostability.
- Cost control at scale.
- Flexibility for teams that already have an evaluation framework and want to plug in capture.
- Integration with existing OpenTelemetry pipelines.
What is missing by default:
- Anything that requires opinionated objective definition.
- Deployment gates, drift alerts, and audit lineage without glue work.
- Calibration tracking unless the team builds it.
Archetype 4: workbench-style
Primary abstraction: experiment. An experiment is a controlled run: fixed inputs, candidate prompts, scored outputs, and side-by-side comparison. The surface is optimized for iteration.
What is native:
- Rapid prompt iteration with score-based comparison.
- A/B and pairwise evaluation across candidate prompts or models.
- Curated dataset management for offline runs.
- Reproducible runs with explicit lineage to the inputs and configurations.
What is secondary:
- Continuous production scoring; experiments are point-in-time runs, not standing infrastructure.
- Real-time drift alerts tied to live traffic.
- Multi-turn session capture as the primary unit.
What all four share
Surface parity is common. All four can show a trace, run an LLM judge, render a chart, and fire an alert. The differences appear when one of these questions arises:
- "Which version of which rubric, judged by which model, produced this score?"
- "Did the score regression on dimension X cause this deployment to block?"
- "Is judge agreement with human labels drifting on the slice that pays the bills?"
- "If we swap frameworks next quarter, does the evaluation framework survive?"
The archetype at the center answers it natively; the others answer it with glue.
Example
A team operating a regulated multi-step agent in production:
- The compliance team needs "which versioned rubric judged this output." That is native to an eval-first platform and a glue project everywhere else.
- The platform team needs span-level debugging when an on-call engineer is paged. That is native to open-source tracing or any APM-style stack.
- The product team needs rapid prompt iteration with side-by-side scores when refining the agent's planner. That is native to a workbench-style tool.
- The agent team uses one framework and benefits from native session capture.
The deployed system composes: an eval-first platform owns the deployment gate and audit lineage; an open-source tracing layer handles capture and feeds the platform; a workbench tool handles prompt iteration before changes are promoted; the framework-coupled instrumentation is used only for the framework's native surface. The composition is more wiring than a single tool, but each piece is good at what it owns. A single tool that claimed to cover all four would have left the awkwardness inside the team's glue code regardless.
Comparison
| Archetype | Primary abstraction | Owns deployment gate | Owns span debugging | Owns experiment iteration | Audit lineage | Portability |
|---|---|---|---|---|---|---|
| Eval-first | Versioned objective | Native | Secondary | Secondary | Native | Strong |
| Framework-coupled | Framework session | Within framework | Within framework | Within framework | Limited | Weak |
| Open-source tracing | Raw span | Build it | Native | Build it | Build it | Strong |
| Workbench-style | Experiment | Limited | Limited | Native | Native within experiment | Moderate |
Who should not adopt an eval-first platform
- Teams whose primary operational question is "what happened in this run," not "did this meet the bar."
- Teams without any written objective or rubric to version.
- Teams whose surface lives entirely inside one framework and is not expected to grow beyond it.
Where each archetype is stronger
- Eval-first: Versioned rubric and judge lineage, CI gating, audit-grade evidence, per-objective drift.
- Framework-coupled: Zero-glue native session handling for teams committed to one SDK.
- Open-source tracing: Vendor independence, self-hostability, cost control.
- Workbench-style: Prompt iteration, A/B and pairwise scoring, reproducible offline runs.
Limitations
- No archetype is a complete production stack on its own. Most mature teams compose two or three; pretending one covers all four leads to silent gaps.
- Feature parity claims paper over architectural differences. A trace viewer in an eval-first platform and a trace viewer in an observability-first platform look the same; the lineage they produce is not.
- Framework-coupled tooling looks attractive early and binding later. Migration cost grows with adoption depth.
- Open-source tracing is "free" only if the eval layer is someone else's job. Otherwise it is a build-it-yourself project that competes with shipping the agent.
- Workbench-style tools can hide the gap between experiment and production. A prompt that wins offline can lose online; production scoring against the same scorecard is what closes the loop.
- Annotation throughput is a separately budgeted constraint. All four archetypes assume some human labeling somewhere; without budgeted reviewer time, calibration stalls.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the standard span and attribute shape underlying open-source tracing.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, for the foundational case on judge calibration that anchors managed-evaluator workflows.
- "Holistic Evaluation of Language Models," Liang et al., 2022, https://arxiv.org/abs/2211.09110, for the dimensional decomposition pattern underlying per-dimension scorecards.
FAQ
Can one tool cover all four archetypes well? Not really. A tool can offer features from every archetype, but the architecture at the center determines which question it answers natively and which it answers with glue.
Where do experiment-tracking tools fit? They are the workbench-style archetype's natural ancestor. They generalize cleanly to LLM experiments when their scoring layer is extended with managed evaluators, and they bind in awkward ways when continuous production scoring is bolted on after the fact.
Is open-source tracing enough on its own? It is enough as a capture substrate. It is not enough as a deployment gate, an audit-lineage source, or a drift alerter. Teams that pick open-source tracing usually end up either building the eval layer or composing with an eval-first platform.
How do we know when a framework-coupled tool has stopped serving us? When the team needs to evaluate the same objective across two implementations and the framework's evaluator cannot represent the second one. That is the architectural signal to add a second component.
Does this analysis change for small teams? The right composition shrinks; the framing does not. A small team can run the discipline (versioned rubrics, managed judges, calibrated against ground truth) by hand for a while and adopt platforms as the surface grows.
What about cost? Cost is a real dimension, and it usually scales with span volume rather than with score count. Open-source tracing wins on cost ceiling; eval-first wins on cost per insight; framework-coupled and workbench tools sit in between. Cost belongs in the comparison, not as a footnote.