Agent Observability Buyer's Guide: Evaluation Criteria

Agent Observability Buyer's Guide: Evaluation Criteria

Updated: 2026-01-09 By: Ari Heljakka

Short answer

A buyer evaluating agent observability platforms is buying a multi-year contract for a system that will shape how the team ships, audits, and remediates LLM agents. The criteria that matter are not the demo features. They are total cost of ownership across all axes (per trace, per evaluator, per seat, per egress), vendor lock-in (data portability, framework neutrality, OpenTelemetry support), integration paths and the cost of leaving, audit and compliance posture (data residency, retention, lineage queries), and operational fit (SLA, support model, on-call workflows). The vendor name is the last variable; the category-level criteria pick the shape of the answer before procurement begins.

Key facts

  • Definition: A buyer's guide for agent observability platforms is a structured evaluation against criteria that survive a vendor's roadmap: cost, lock-in, integration, audit, and operational fit, with weights set by the team's actual use case.
  • When to use: Whenever the platform choice will outlive the current agent build and the cost of switching is material. Selection without these criteria is procurement by demo.
  • Limitations: Pricing pages reveal three or four of the cost axes and hide the rest. Real TCO requires explicit estimates against representative workloads. Vendor SLAs vary widely in what they actually guarantee.
  • Example: A team picks the platform that wins the demo, then discovers that egress charges, evaluator run costs, and per-seat pricing for the access-control plane triple the quoted bill. A buyer's guide forces the full TCO conversation before the contract.

Key takeaways

  • Total cost of ownership is rarely the price on the pricing page. Per-trace, per-evaluator, per-seat, per-egress, plus the engineer time to operate the platform.
  • Lock-in is paid as friction on the day the team wants to leave. Data portability and framework neutrality are insurance, not features.
  • OpenTelemetry support is the most under-asked question. A platform that reads OTLP is one the team can leave; a platform with a proprietary SDK is harder to exit.
  • Audit posture (data residency, retention, lineage queries, deletion semantics) is rarely demoed and almost always negotiable.
  • The SLA matters less than the support model. Time-to-engineer on a serious incident is a better signal than five nines on a glossy page.

Definition

A buyer's evaluation of agent observability platforms is a structured comparison oriented around the contract a buyer is signing: what the team gets in exchange for money, time, and dependency. The criteria are intentionally category-level: TCO, lock-in, integration paths, audit posture, and operational fit. They survive vendor roadmaps because they describe properties of the contract, not properties of this week's release notes.

The output is a defensible recommendation, a written TCO estimate, and a list of contract terms to negotiate before signing.

When this matters

Buyer-level criteria become decisive whenever:

  • The contract is multi-year or has significant cancellation friction.
  • The platform will hold production trace data, evaluation lineage, or PII.
  • More than one team will use it, which makes per-seat and access-control pricing relevant.
  • Audit, compliance, or regulator pressure makes data residency and retention non-negotiable.
  • The team has been burned before by a contract whose real cost looked nothing like the demo.

If the platform is a free or low-cost trial with a small footprint, the criteria still apply but at lower stakes. Once the contract is large, the buyer-level concerns dominate.

How it works

Total cost of ownership

A real TCO model has five cost axes:

  • Per-trace ingestion. Volume, sampling rate, retention period, replay cost.
  • Per-evaluator execution. Especially LLM-judge evaluators, which charge for the judge's tokens plus a platform markup.
  • Per-seat or per-user. Especially for access control, role-based features, and audit plane.
  • Storage and egress. Retention beyond the default, exports, and (often hidden) egress charges when moving data out.
  • Operational cost. Engineer time to instrument, maintain, debug, and respond to platform incidents.

A TCO estimate sums all five against a representative workload. The pricing page usually exposes the first three. The last two are negotiated or hidden, and they dominate the cost at scale.

A defensible buyer's spreadsheet:

  • Estimated traces per month, with growth.
  • Estimated evaluator runs per month (sampling fraction times traces times evaluators per trace).
  • Engineer headcount that needs access.
  • Retention in months.
  • Expected export volume.
  • A reasonable estimate of engineer time per quarter (instrumentation, evaluator maintenance, incident response).

The total is the number to compare against alternatives, including "build it on top of OpenTelemetry plus a managed judge runtime."

Vendor lock-in

Lock-in is paid as exit friction. Useful questions:

  • Data portability. Can production traces be exported in a standard format (OTLP) at any time. Is export rate-limited. Is export charged per gigabyte.
  • Schema portability. Are span attributes and evaluator scores stored in a format the team can re-host. Or are they wrapped in vendor-specific encodings.
  • Framework neutrality. Does the platform read OTLP from any framework, or does it require the vendor's SDK. The SDK is convenient; it is also coupling.
  • Evaluator portability. Are evaluators expressed in versionable code the team controls, or in a vendor UI. UI-authored evaluators do not migrate.
  • Audit lineage portability. Can rubric versions, evaluator versions, and dataset versions be exported with the scores. Without that, lineage queries die at the vendor boundary.
  • Identity and access. Does the platform integrate with the team's identity provider. Is role assignment exportable.

Lock-in is not always wrong. It can be the price for genuinely better integration. The point of the criterion is to know what is being traded.

Integration paths

Two paths matter:

  • Day-one integration. The cost in engineer time and code changes to get the first trace into the platform and the first evaluator running against it. Lower is better.
  • Day-N integration. The cost in engineer time and code changes to add the second framework, the second evaluator, the second team, and (eventually) to replace the platform. Lower is better.

A platform whose day-one is fast and day-N is expensive often signals high coupling. A platform whose day-one is slower because it asks for OTLP and a versioned rubric file is usually cheaper at day-N.

Audit and compliance posture

Several questions that buyers in regulated industries always ask and demos rarely answer:

  • Data residency. Where do traces live geographically. Is it configurable per workspace or per region.
  • Retention. Default retention, maximum retention, deletion semantics (soft delete vs hard delete vs cryptographic shred).
  • PII handling. Can traces redact PII at the edge before they reach the platform. Are there hooks for the application to mark spans as sensitive.
  • Lineage queries. For any score on any trace, can the platform return the rubric version, evaluator version, dataset version, and judge model in one query.
  • Audit log of platform access. Who looked at what trace, who exported what data, who changed what rubric.
  • Compliance certifications. SOC 2, ISO 27001, HIPAA if applicable. Recency of the report.
  • Subprocessor list. Especially for LLM-judge evaluators, which often route through third-party model providers.

The audit posture is the part most likely to require negotiation before signing.

Operational fit

Beyond price and lock-in:

  • SLA and credits. What is the actual SLA, what triggers credits, how are credits applied. A 99.9 percent SLA with a 1 percent credit cap is not the same as 99.9 with full refund.
  • Support model. Email-only, chat, dedicated channel, named engineer. Time-to-first-response by severity tier.
  • On-call workflow. Does the platform integrate with the team's pager. Does it expose health and self-test endpoints.
  • Roadmap visibility. Is the platform's roadmap shared, with input from customers, or is every release a surprise.
  • Reference customers. Teams of similar shape and scale running the platform in production. References that match the buyer's profile carry more weight than generic logos.

Contract terms to negotiate

A short list that usually pays off:

  • Price protection. Cap on price increases at renewal.
  • Volume bands. Per-trace and per-evaluator pricing tiers with clear breakpoints, not opaque enterprise pricing.
  • Egress at cost. Cap on data export charges, or none at all.
  • Termination assistance. A written plan for data migration on contract end, with timeline.
  • Data deletion on termination. Explicit deletion timeline with proof.
  • Audit cooperation. Right to audit the platform's controls if the buyer is in a regulated industry.
  • Subprocessor change notice. Advance notice when judge models change provider.

These terms usually require a procurement conversation. The buyer's guide is what justifies asking for them.

Example

A team in regulated healthcare runs the buyer's evaluation against three platforms with similar feature lists.

Workload:

  • 4 million traces per month.
  • 10 percent evaluator sampling, 3 evaluators per sampled trace.
  • 25 engineer seats, 6 of them needing audit access.
  • 18-month retention requirement.
  • Regulatory requirement: regional data residency, lineage queries within one business day.

TCO estimates:

Cost axisPlatform APlatform BPlatform C
Per-trace ingestionHighMediumMedium
Per-evaluator runsMediumLow (open)High
Per-seatHighLowMedium
Storage and egressMediumLowHigh
Engineer time per quarterLowMediumLow
TCO over 24 monthsHighestLowestMiddle

Lock-in:

  • Platform A: proprietary SDK, no OTLP. Export charged per GB.
  • Platform B: OTLP native, evaluators in repo, no egress charges.
  • Platform C: proprietary SDK with OTLP shim. Evaluator UI, no code export.

Audit posture:

  • Platform A: data residency configurable, lineage requires manual SQL.
  • Platform B: data residency configurable, lineage native (one query).
  • Platform C: residency only in US, lineage not supported.

Recommendation: Platform B. Highest day-one integration cost, lowest TCO, lowest lock-in, strongest audit posture. The buyer's guide justifies absorbing the day-one cost in exchange for the multi-year contract shape.

A consumer-product team running the same evaluation with different weights (low audit need, single team, small traffic) reaches a different recommendation, and the artifact documents why.

Comparison

A category-level view of how three archetypes score against buyer criteria:

CriterionHosted, proxy-based, full-featureOpenTelemetry-native, eval-firstOpen-source, self-hosted
TCOHigh at scale (per-trace).Medium, predictable.Low licence, high engineering.
Lock-inHigh (proprietary SDK, store).Low (OTLP, code-first evals).Lowest (self-hosted).
Day-one integrationFastest.Moderate.Slowest.
Day-N integrationExpensive.Cheap.Cheapest after ramp.
Audit postureVaries; usually negotiable.Often strongest.Strong, owned by team.
Operational fitStrong (vendor SLA).Strong (vendor SLA).Owned by team.
Best whenFast time to value matters most.Long-lived agent, audit weight.Highest control needs.

Who should not use a hosted eval-first platform

Teams without a defined operational question or without budget for the day-N audit and gating story usually do not see the value. A small team running one agent inside one framework with no regulator on the horizon will get more from a bundled tool plus a trace store than from a hosted eval-first platform.

Where each category is stronger

  • Hosted, proxy-based, full-feature platforms win on day-one velocity and demo polish. They pay back in the early sprints and cost more over years.
  • OpenTelemetry-native, evaluation-first platforms win on day-N cost, audit, and lock-in posture. They cost more to integrate and survive everything that comes next.
  • Open-source, self-hosted stacks win on control and unit cost. They cost more in engineer time and require explicit ownership of the SLA.

Limitations

  • TCO estimates are estimates. Traffic patterns, evaluator sampling, and engineer time all move. The TCO sheet should be rebuilt at renewal, not signed once.
  • Lock-in is a continuum. Even OTLP-native platforms have some lock-in via UI, integrations, and evaluator catalogues. The question is the cost of an exit, not whether one is possible.
  • Demos hide audit posture. Most demos do not include a lineage query against a real workload. The buyer has to ask explicitly.
  • References age. A reference customer two years in may be running an older version of the platform than the one the buyer would adopt. The reference is a starting point.
  • Procurement timelines are optimistic. Three-month evaluations slip. Building the TCO sheet and stress-test plan early shortens the slip.
  • A buyer's guide does not replace a trial. Numbers on a spreadsheet are weaker evidence than a representative workload running through the platform for two weeks.

Evidence and sources

FAQ

What is the single most under-asked question by buyers? "Can I export every trace, every evaluator score, and every lineage record in OTLP, at any time, at no marginal cost." If the answer is anything other than yes, lock-in is being paid in advance.

Is OpenTelemetry support a real differentiator? Yes. It is the property that makes a platform survivable. A platform that consumes OTLP can be left without rewriting the application. A platform that does not can be left only by rewriting it.

How do I estimate TCO when the vendor will not give a price? Build the workload model first (traces, evaluators, seats, retention, egress). Ask each vendor for a quote against the model. A vendor that will not quote against a specific workload is signalling the price will surprise the buyer.

Should the team negotiate the contract? Yes. Price increases at renewal, egress at cost, termination assistance, deletion timelines, and subprocessor notice are all routinely negotiated. A buyer that does not ask gets the default contract, which is usually written for the vendor.

How long should an evaluation take? Long enough to run a representative workload, score it against a small calibration set, and stress-test the audit story. Two weeks is fast; six weeks is normal for a multi-year contract.

What if the right answer is to build on top of OpenTelemetry rather than buy? That is a legitimate outcome of a buyer's guide. The criteria apply to a self-hosted stack as much as to a vendor. The engineering cost line is bigger and the lock-in line is smaller.

Related reading