Updated: 2026-02-11 By: Ari Heljakka
Short answer
Reviews of agent observability platforms collapse into "I liked this UI" unless the reviewer brings a fixed rubric. This post is the rubric. Five criteria, each scored on a 0 to 1 scale against published behavior (not marketing): data model fidelity, evaluator integration, sampling discipline, cost model, and on-call workflows. Composing the five into a scorecard turns reviews into reproducible artifacts and surfaces the trade-offs each platform actually forces.
Key facts
- Definition: A reviewer-side rubric for grading agent observability platforms across five orthogonal dimensions, each scored 0 to 1, composed into a per-platform scorecard.
- When to use: Before selecting or recommending a platform; before writing a public comparison; during a quarterly re-evaluation of the existing stack.
- Limitations: The rubric scores capability fit; it does not score implementation pain, vendor stability, or compliance posture, which require separate criteria.
- Example: Two platforms tie on UI quality; the rubric separates them on evaluator integration (0.9 vs 0.4) and cost model (0.5 vs 0.8), revealing the trade-off a screenshot tour would have hidden.
Key takeaways
- Reviews without a rubric collapse to taste. A rubric makes them reproducible and forces specific trade-offs into the open.
- The five orthogonal criteria are data model fidelity, evaluator integration, sampling discipline, cost model, and on-call workflows.
- Each criterion is scored 0 to 1 against published behavior or a reproducible test, then composed into a weighted scorecard.
- Composition (not a single number) preserves the trade-off structure that single-score reviews hide.
- The rubric is independent of vendor. Re-running it on the same platforms in a year produces drift signal on each one.
Definition
Reviewer-side evaluation criteria are the dimensions a reviewer scores against a platform's published behavior or a reproducible test, independent of the platform's marketing. They differ from buyer-side selection criteria in that they are designed to be applied consistently across multiple platforms in a public comparison, not against a single team's operational context.
A criterion is well-formed when it (1) maps to a property the platform either has or does not have, (2) can be scored without privileged access, and (3) is orthogonal to the other criteria. The five below have been used by reviewers and platform engineers across multiple comparisons and tend to hold under pressure.
When this matters
The rubric matters most in two situations:
- Writing or reading a comparison. Platform comparisons without an explicit rubric are not falsifiable. The reader cannot tell whether the reviewer scored what the reader cares about. A published rubric makes the review reproducible.
- Re-evaluating the existing stack. Platforms move. A rubric run on the same stack a year later surfaces capability drift, both gains (a vendor shipped versioned objectives) and regressions (a pricing change moved cost-model into the red).
The rubric does not replace operational judgment; it concentrates it on the dimensions that decide whether the platform holds up.
How it works: the five criteria and the composite scorecard
Each criterion is described below with the question it answers, what 0, 0.5, and 1 look like, and a quick test to apply. The five compose into a per-platform scorecard with weights chosen to match the team's operational profile.
The five criteria
Criterion 1, data model fidelity
Question: What is the platform's primitive, and does it capture what an agent actually does?
- 0: The primitive is a free-text log line. Tool calls, plans, and retries are stored as strings without structure.
- 0.5: The primitive is a span with structured fields, but tool results, plans, and re-plans are stored as opaque blobs.
- 1: The primitive is a trace tree with first-class span types for model calls, tool calls (with structured request and structured response), planner outputs, retries, and side effects.
Test: Open a multi-step agent trace. Count the number of types of events you can filter on. Less than four is a 0; four to seven is a 0.5; eight or more is approaching 1.
Criterion 2, evaluator integration
Question: How does the platform attach scored signals to traces?
- 0: No evaluator integration. Traces are searchable; scoring is the team's problem.
- 0.5: Bring-your-own evaluator. The platform stores a numeric score next to a trace; versioning and calibration are external.
- 1: Managed evaluators with versioned objectives, calibration datasets, judge model pins, normalized 0 to 1 scores, and tracked judge agreement against ground truth.
Test: Ask the platform "what is the current version of the faithfulness evaluator, what dataset is it calibrated against, and what is its agreement rate with human labels this week?" If the answers exist as first-class artifacts, the score is 1.
Criterion 3, sampling discipline
Question: How does the platform let the team control the cost-versus-coverage trade-off?
- 0: No sampling controls. Either everything is captured (and billed) or nothing is.
- 0.5: Head-based sampling at a global rate. No per-surface, per-route, or per-dimension control.
- 1: Head, tail, and boundary sampling, configurable per surface or per route. Mandatory-capture rules for compliance-sensitive paths.
Test: Configure 100 percent capture for a regulatory path and 5 percent head-based sampling for a high-volume non-critical path. If the platform supports both in the same workspace, score is approaching 1.
Criterion 4, cost model
Question: Can the team predict the bill at production volume?
- 0: Cost is per-event with no visible scaling, no retention tiers, no separation of evaluator-call cost from ingest cost.
- 0.5: Clear per-event pricing, retention tiers, but evaluator and query costs are bundled or opaque.
- 1: Per-event pricing, retention tiers, evaluator-call pricing, query-cost transparency, and egress disclosed. A reproducible calculator that can be run against the team's actual volume.
Test: Build a cost model at 10x the pilot volume. If the model is reproducible from public pricing without a sales conversation, the platform is approaching 1.
Criterion 5, on-call workflows
Question: How quickly can an on-call engineer get from page to root cause?
- 0: Trace explorer with filters; no clustering, no diffs, no replays.
- 0.5: Trace explorer plus saved views and annotations. Manual cluster construction.
- 1: Failure clustering by semantic pattern, version-diff views (prompt or model), replay against a candidate prompt or model, deep-link from alert to span.
Test: Simulate an incident on a real or replayed failure. Time from page to root-cause trace, from page to a cluster of similar failures, and from page to a diff between two versions. Below 60 seconds across the three is approaching 1.
Composing the scorecard
Each criterion produces a 0 to 1 score; the five compose into a per-platform scorecard. Weights are not universal; pick them to match the team's operational profile.
A reasonable default weighting:
| Criterion | Default weight | Reasoning |
|---|---|---|
| Data model fidelity | 0.25 | Sets the ceiling on what every other criterion can do. |
| Evaluator integration | 0.25 | Decides whether the stack is a viewer or a quality system. |
| Sampling discipline | 0.15 | Decides whether cost or the team controls coverage. |
| Cost model | 0.15 | Predicts whether the platform survives 12 months of real volume. |
| On-call workflows | 0.20 | Decides incident time-to-resolution; compounds over the on-call year. |
A weighted composite produces a single number for ranking, but the per-criterion breakdown is the artifact reviewers should publish; it preserves the trade-off structure that a single score collapses.
Example: applying the rubric to two platforms
A reviewer compares two platforms used at the same trace volume on the same agent.
Platform P
- Data model fidelity: 1.0 (full trace tree, first-class tool spans).
- Evaluator integration: 0.5 (bring-your-own; scores attached, not versioned).
- Sampling: 0.5 (head-based only).
- Cost model: 0.8 (clear per-event, retention tiers, evaluator-call cost separable).
- On-call: 0.5 (explorer plus saved views, no clustering).
Weighted composite: 0.70.
Platform Q
- Data model fidelity: 0.7 (spans yes, retries opaque).
- Evaluator integration: 1.0 (managed, versioned, calibrated).
- Sampling: 1.0 (head, tail, boundary, per-surface).
- Cost model: 0.5 (opaque evaluator-call pricing).
- On-call: 0.9 (clustering, version diff, replay).
Weighted composite: 0.80.
Platform Q wins on the composite, but the per-criterion breakdown surfaces the trade-off: Platform P has stronger raw trace fidelity, Platform Q has a stronger quality system around weaker traces. A team whose biggest pain is incident response will favor Q; a team that needs deep retry-level instrumentation may favor P, or use both.
This is what the rubric makes visible. A screenshot tour would have produced a winner without explaining the trade-off.
Comparison: where each criterion dominates
| Team context | Criterion that dominates the choice |
|---|---|
| Quality regressions are the top failure mode | Evaluator integration |
| Trace volume is 1M+ per day | Cost model and sampling discipline |
| Frequent on-call pages | On-call workflows |
| Heavy tool use, complex agents | Data model fidelity |
| Regulated or audit-heavy industry | Sampling discipline (mandatory-capture) |
The same platform can win for one team and lose for another. The rubric does not produce a universal ranking; it produces a comparable scorecard, which is the right artifact for a public review.
Limitations
- The rubric scores capability, not implementation pain. Migration, vendor stability, and integration depth are separate.
- Public scoring depends on public behavior. Some platforms hide pricing or evaluator details behind a sales call; those criteria are harder to score reproducibly.
- Weights are not universal. Apply them to the team's operational profile; do not adopt a vendor-supplied weighting.
- Re-running the rubric is overhead. Re-run it quarterly or after a major platform release; do not redo it every week.
- Some products span categories. A platform that is half eval-first and half observability-first will have unusual scores; that is signal, not noise.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace-and-span model behind data-model fidelity scoring.
- OpenTelemetry sampling documentation, https://opentelemetry.io/docs/concepts/sampling/, for the head, tail, and probabilistic sampling primitives behind sampling discipline.
- Anthropic, Test and evaluate, https://docs.anthropic.com/en/docs/test-and-evaluate/, for the rubric-and-judge model behind managed evaluators.
FAQ
Should I publish my scorecard with weights? Yes, with both the per-criterion scores and the weights, so the reader can re-weight to their own context.
Can the rubric be automated? Partially. Data model fidelity and on-call workflows benefit from human judgment. Sampling and cost can be scored from documentation. Evaluator integration depends on whether the platform exposes versioning as first-class.
Should I add criteria specific to my industry? Yes. Compliance, data residency, and audit trails are often blocking; add them as binary gates upstream of the rubric or as additional dimensions with their own weights.
Why not just use a feature checklist? Feature checklists treat features as binary and do not weight them. The rubric is graded and weighted, which is what allows comparisons to reflect operational reality.
How often should I re-run the rubric? Quarterly, or after any platform change that touches one of the five criteria (pricing, evaluator framework, sampling, data model). Track scores over time as drift signal.