Updated: 2026-02-12 By: Ari Heljakka
Short answer
Observability-first platforms capture traces, spans, and tokens as the primary artifact, then bolt evaluation on as a downstream lens. Evaluation-first platforms invert that: a versioned set of objectives and judges is the system of record, and traces are an input the objectives are continuously scored against. Most teams need pieces of both, but the architectural choice dictates what is easy and what is awkward: drift detection, CI gating, audit trail, and the cost of running judges at scale.
Key facts
- Definition: Observability-first platforms are built around the trace and metric pipeline; evaluation is a feature added on top. Evaluation-first platforms are built around versioned objectives and managed evaluators; traces are an input the evaluators consume.
- When to use: Reach for observability-first when the dominant question is "what happened in this run?" Reach for evaluation-first when the dominant question is "did this run meet our success criteria, and would the next change still meet them?"
- Limitations: Observability-first stacks tend to treat scores as another column on a span and lack first-class rubric versioning. Evaluation-first stacks need explicit ingestion of traces and can feel heavier for teams that just want a dashboard of token usage.
- Example: A team shipping an agent in regulated finance pairs a trace store for span-level debugging with an evaluation system that holds versioned rubrics, ground truth datasets, and CI gates blocking deployments on score regressions.
Key takeaways
- The two categories differ in what they treat as the source of truth: traces versus objectives.
- Observability-first platforms excel at debugging individual runs and visualizing token, latency, and cost distributions.
- Evaluation-first platforms excel at scoring runs against versioned success criteria and blocking regressions in CI.
- Both can run LLM-as-a-judge, but only one treats the judge itself as a versioned, managed component.
- Drift detection, audit trails, and deployment gates land in different places depending on which architecture you adopt.
Definition
An observability-first platform is organized around the OpenTelemetry-style trace: the unit of capture is a span with attributes (model, prompt, tokens, latency, cost), and the primary user actions are filtering, grouping, and inspecting those spans. Evaluation is a downstream feature: scores are attributes attached to spans, sometimes produced by built-in evaluators, sometimes ingested from an external scorer. The platform's center of gravity is "what happened."
An evaluation-first platform is organized around the objective and the evaluator. The unit of capture is the scored sample: an input, an output, and one or more 0 to 1 scores produced by versioned, managed evaluators against versioned rubrics. Traces are an input the evaluators run against, not the primary artifact. The platform's center of gravity is "did this meet the bar, and where did it drift?"
The distinction is not academic. It dictates how the platform models data, how it gates deployments, and what is auditable after the fact. A trace store with evaluation tacked on can compute a score, but it usually cannot answer "which version of which rubric, run by which judge model, produced this 0.74?" An evaluation-first store models that lineage as a first-class concern.
When this matters
The architectural difference becomes decisive when one or more of these conditions holds:
- Compliance and audit. Regulated surfaces need to answer "which versioned objective, judged by which versioned evaluator, produced the decision that gated this output?" That lineage is native to evaluation-first systems and bolt-on for observability-first ones.
- CI gating. Treating evaluation as a deployment gate (the same way unit tests are a deployment gate) requires a stable, queryable contract for "the current scorecard." Evaluation-first systems expose that natively. Observability-first systems require glue code to aggregate spans into a gate signal.
- Drift as a first-class signal. When score drift on a specific dimension (toxicity, factuality, instruction following) must trigger alerts independent of trace volume, you need per-objective monitoring. Observability-first dashboards typically slice by service or model, not by versioned rubric.
- Multi-implementation portability. When the same objective is enforced by multiple implementations (rules in front, LLM judge behind, human review at the edge), the evaluation system must hold the objective independently of any implementation. Observability-first platforms tie scoring to the trace, which couples objective and implementation.
If none of those conditions holds, an observability-first platform with evaluator plugins may be all you need. If any does, the architectural choice starts to dominate.
How it works
Observability-first
A typical observability-first pipeline:
- Instrumentation. The application emits spans via OpenTelemetry or a vendor SDK. Each span carries prompt, model, tokens, latency, cost, and arbitrary attributes.
- Trace store. Spans land in a columnar store optimized for time-series and high-cardinality queries.
- Dashboards. Token usage, latency percentiles, cost per route, error rates. Filtering and grouping by attribute is the primary interaction.
- Evaluators as plugins. Optional evaluators (built-in or user-defined) run against sampled spans and write scores back as additional attributes.
- Alerts. Triggered on metric anomalies (cost spike, latency P99, error rate) and, optionally, on score thresholds.
The dominant abstraction is the trace. Scores live alongside other span attributes; rubric versioning, if it exists, lives in the evaluator definition rather than as a first-class object.
Evaluation-first
A typical evaluation-first pipeline:
- Objectives. A versioned catalogue of success criteria, each with a written rubric and (often) a ground truth dataset.
- Managed evaluators. Each objective has one or more implementations (LLM judge, rule, classical metric). The evaluator is itself a versioned artifact with an explicit model, prompt, and threshold.
- Ingestion. Production traces (or curated samples) are fed into the evaluator pipeline. Inputs and outputs are scored against the relevant objectives.
- Scorecards. Each run produces a scorecard: a vector of 0 to 1 scores across orthogonal dimensions, composable into aggregate metrics.
- Gates and alerts. CI deployments are gated by score on a held-out evaluation set. Drift on any dimension triggers an alert tied to the specific objective and evaluator version.
- Calibration loop. Judge accuracy is continuously validated against human-labeled ground truth. Recalibration is triggered when agreement drops below threshold.
The dominant abstraction is the scored sample against a versioned objective. Traces are an input, not the product.
What both share
Both categories often expose the same surface features: a trace viewer, an evaluator catalogue, dashboards, and alerts. The difference is which abstraction is foundational. A platform built on traces with evaluators bolted on cannot easily promise "this rubric version, judged by this evaluator version, produced this score." A platform built on objectives can.
Example
A team shipping a customer-support agent in a regulated industry:
- Observability-first slice. A trace store captures every agent run with full span detail, including the system prompt, retrieved documents, tool calls, latency, and tokens. When a user complains about a specific reply, an engineer pulls the trace by ID and walks the spans.
- Evaluation-first slice. A separate system holds the versioned rubric for "answer is grounded in retrieved documents," "answer follows the refund policy," and "answer is appropriate in tone." Each rubric has a ground truth dataset of labeled examples and a managed LLM judge with a pinned model and prompt. A nightly job scores a sample of production traces against each rubric. A CI job scores a held-out evaluation set before every deployment. A regression on any dimension blocks the deploy.
- Where each pays off. The trace store answers "what did the agent say to this user." The evaluation store answers "did the agent meet our bar across the dimensions we care about, did the bar move when we changed the prompt, and which evaluator version produced the score that gated the last deployment."
Combining them is the common pattern. The architectural question is which abstraction owns the deployment gate and the audit trail.
Comparison
A category-level view, with the wins distributed across both:
| Criterion | Observability-first | Evaluation-first |
|---|---|---|
| Primary unit | Trace, span, metric. | Scored sample against a versioned objective. |
| Source of truth | What happened in production. | Whether output met the success criteria. |
| Rubric versioning | Often implicit, tied to evaluator config. | First-class, versioned artifact. |
| Judge versioning | Usually a single evaluator definition per name. | Each judge is a managed component with model, prompt, and threshold pinned. |
| CI gating | Requires glue to aggregate spans into a gate signal. | Native: gate on scorecard against held-out set. |
| Audit lineage | Trace ID plus span attributes. | Objective version plus evaluator version plus dataset version. |
| Span-level debugging | Strong. Trace viewer is the main surface. | Usually present, but secondary to scorecard view. |
| Per-dimension drift alerts | Possible, often glued together from metrics. | Native: alert per objective and per dimension. |
| Calibration tracking | Limited. Agreement with ground truth is rarely a first-class metric. | Native: judge agreement is itself a tracked metric over time. |
| Cost of running judges | Pay-per-trace plus pay-per-evaluation if sampled at high rates. | Usually decoupled: judge runs against curated or sampled inputs. |
| Instrumentation effort | High: every span needs OpenTelemetry or SDK coverage. | Lower if scoring curated samples; comparable if scoring all production. |
| Model and prompt portability | Tied to instrumented runtime. | Native: same objective scores the next model implementation. |
The pattern across rows: observability-first wins on debugging individual runs and visualizing system metrics; evaluation-first wins on rubric and judge versioning, CI gating, and audit lineage. Neither subsumes the other.
Limitations
Both categories have real soft spots:
- Observability-first scoring drifts silently. If the evaluator definition changes but the score attribute name does not, dashboards keep rendering, but the meaning under the curve has shifted. Without rubric versioning as a first-class object, this is hard to detect.
- Observability-first dashboards reward what is easy to measure. Token usage, latency, and cost are easy. Per-dimension judge agreement against ground truth is not. The dashboard tells you what the system optimized for, not what the product needed.
- Evaluation-first systems need explicit ingestion. If production traces are not flowing into the scoring pipeline, the scorecard is a stale view of last week. Closing the loop between trace capture and evaluation requires deliberate plumbing.
- Evaluation-first scorecards are only as good as the ground truth dataset. A judge calibrated against a stale dataset will score regressions against a stale bar. Dataset versioning, refresh cadence, and adversarial coverage are operational concerns.
- Both can fall into the trap of treating scores as truth. A 0.84 from a judge is a measurement, not a verdict. Judge agreement with human labels, per dimension, is the real quality gate.
- Coupling implementation to objective is easy to do by accident. A trace attribute called
produced by one specific judge model is not the same artifact as a versioned
objective. The naming hides the coupling until the model changes.
Where each category is stronger
Observability-first plays well when
- The dominant operational question is "what happened in this run."
- Most failures are caught by metric anomalies (cost, latency, error rate) rather than by score regressions on specific dimensions.
- The team already runs a strong OpenTelemetry pipeline and wants AI runs to land in the same store.
- Evaluation is sampled and lightweight, not the deployment gate.
Evaluation-first plays well when
- Success criteria are written, versioned, and treated as the contract between product and engineering.
- Deployments are gated on a scorecard the same way they are gated on unit tests.
- Compliance, audit, or regulatory pressure forces explicit lineage from decision to versioned rubric.
- The same objective is enforced by multiple implementations (rule, LLM judge, human review) and must remain portable across them.
The two are not exclusive. Many production stacks run both, with traces in one store and scored objectives in another, and a glue layer that ties trace IDs to scorecard entries.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the span shape that observability-first platforms standardize on.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, for the foundational case that judge quality must itself be measured.
- "Holistic Evaluation of Language Models," Liang et al., 2022, https://arxiv.org/abs/2211.09110, for the dimensional decomposition pattern underlying scorecards across orthogonal dimensions.
FAQ
Is observability-first just a renaming of APM for AI? Closer than the marketing suggests. The data model (trace, span, metric) and the dominant surfaces (filter, group, visualize) are inherited from application performance monitoring. AI-specific additions (token attributes, model attributes, score attributes) are layered on top.
Can an observability-first platform act as an evaluation-first one if I add evaluators? Partially. You can compute scores and attach them to spans. What is harder is treating the rubric and judge as first-class versioned artifacts, gating CI on a scorecard, and producing audit-grade lineage from decision to rubric version. Those require an architecture that puts the objective at the center, not the trace.
Does evaluation-first replace the need for a trace store? Usually not. Span-level debugging is still the right tool for "why did this specific run produce this output." Evaluation-first systems answer a different question: did the run meet the bar, and is the bar holding over time.
How do I decide which one to adopt first? Start from the operational question that is biting hardest. If you are blind to what is happening per run, the trace store comes first. If you are blind to whether outputs meet your success criteria, the evaluation system comes first. Most mature stacks end up with both.
What does "managed evaluator" mean in practice? A managed evaluator is an evaluator whose model, prompt, threshold, and ground truth dataset are pinned, versioned, and queryable. When the evaluator runs, the resulting score carries explicit lineage back to those pinned components. Swapping the underlying model is a deliberate version change, not a silent drift.
Does this debate matter if I am only running one or two prompts? Not much. At small scale, a spreadsheet of test cases and a CI script that reruns them is plenty. The architectural distinction starts to bite when you have multiple objectives, multiple implementations, and a deployment cadence fast enough that drift matters.