Updated: 2026-01-16 By: Ari Heljakka
Short answer
For a platform engineering team, AI evaluation is an infrastructure build, not a research project. Product teams own the objective ("does my agent handle expense reports correctly"); the platform team owns the underlying machinery that makes objectives measurable at scale: a trace and dataset pipeline, an evaluator execution service, versioned golden datasets as managed resources, and CI gates that block bad deployments. Treat evaluation the way you treat logging, metrics, and feature flags: multi-tenant, self-serve, audited, and consumed through a small set of stable interfaces. Built well, evaluation infrastructure lets dozens of product teams ship AI features against a single, defensible quality bar without the platform team becoming the bottleneck.
Key facts
- Definition: Evaluation infrastructure is the platform layer that lets product teams define objectives, attach evaluators, run them at scale against versioned datasets, and gate deployments on the results, all without standing up bespoke pipelines per team.
- When to use: Whenever more than one product team is shipping LLM-powered features and you want a single quality bar, single source of truth for ground truth, and single CI gate pattern.
- Limitations: Cannot replace product-team ownership of the objective. Cannot succeed without an annotation workflow. Cannot ignore cost: LLM-as-judge calls at scale are a real budget line.
- Example: A platform team exposes an evaluator API, a dataset registry, and a reusable
CI action. Product teams publish objectives and datasets through a self-serve UI; CI runs the evaluator panel on every prompt change and blocks the merge on regression.
Key takeaways
- Evaluation follows the same pipeline pattern as observability (capture, store, query, alert). Platform teams already know how to build this.
- The three layers worth designing carefully are the data pipeline (what gets sampled and stored), the execution service (multi-tenant evaluator runs), and the CI gate (release decisions).
- Golden datasets are managed resources with versioned snapshots, owner metadata, lineage, and retention policy. Treat them like database schemas, not like text files in a repo.
- Self-serve is the success metric. If product teams need a platform engineer to add a new objective or wire a CI gate, the platform has failed.
- Objectives must be separable from implementations. The same objective should be measurable across model swaps, prompt rewrites, and orchestration changes; that is what makes the platform durable.
Definition
Evaluation infrastructure for an organisation is the set of managed components that lets any team measure the quality of an LLM-powered feature against a versioned objective without writing pipeline code. It typically has three layers:
- Data pipeline. Captures production traces (with sampling), stores them, and surfaces them for annotation and replay.
- Execution service. Runs evaluators (rule-based, LLM-as-judge, reference-based) at scale, on demand, against either a live trace stream or a versioned dataset.
- Release gates. Plugs into CI to fail merges that regress on objectives the product team has declared blocking.
Around those three layers sit shared resources: a dataset registry (versioned ground-truth snapshots), an evaluator catalog (versioned judge prompts and rule definitions, each calibrated against human labels), and a scorecard service (per-feature per-dimension scores over time).
The platform team owns the layers and the registries. Product teams own the objectives, the labels, and the deploy decisions. Done well, the platform never says "yes" or "no" to a release; it just makes the gates trustworthy enough that product teams can.
When this matters
The build becomes worthwhile at three thresholds:
- Second product team. As soon as a second team starts shipping AI features, a single quality bar across teams is cheaper to maintain centrally than to repeat per team.
- First regulated surface. Auditability requirements (immutable versioned datasets, evaluator lineage, reproducible release gates) are much harder to retrofit than to design in. The first regulated feature is the cheapest one to use as the forcing function.
- First production incident traced to silent quality drift. Once a team has had a model swap, prompt rewrite, or dataset rotation cause a real user-visible regression, the appetite for centralised gates appears. Better to build before the incident than after.
Below those thresholds, a lightweight per-team approach is reasonable. Above them, the per-team cost compounds quickly and the platform investment pays back inside a quarter.
How it works
Layer 1: Eval data pipeline
The data pipeline answers "what traces are available, at what resolution, for evaluation."
- Capture. Every LLM call across every product team emits an OpenTelemetry GenAI span (input, output, model, tokens, latency, retrieved context, tool calls). Standardising on the OpenTelemetry schema is what makes evaluation reusable across teams.
- Sampling. Sampling is stratified, not uniform. A typical configuration: 1 to 5% nominal sample, 100% of anomaly-flagged sessions, 100% of human-flagged failures, 100% of a small synthetic adversarial canary slice. The sampling policy lives in the platform, configurable per tenant.
- Storage. Sampled traces are stored in a queryable, columnar store with retention by tier (hot, warm, cold). Hot retention covers the active evaluation window; warm covers regression backstop; cold covers audit replay.
- Annotation queue. A subset of sampled traces is routed to annotation. The annotation tool is owned by the platform; the annotation rubric is owned by the product team.
- Dataset materialisation. Annotated traces snapshot into versioned ground-truth datasets. Each snapshot has an immutable hash, a creation date, the model version it was sampled under, and the set of failure modes it covers.
Versioning is the discipline that makes audits cheap. A change in scores between two runs has exactly two possible causes: a different evaluator version or a different dataset version. With immutable snapshots, attribution is trivial; without them, every regression becomes a forensic exercise.
Layer 2: Eval execution service
The execution service answers "given an objective and a dataset, what is the score."
- Evaluator catalog. A versioned registry of evaluators (rule checks, LLM-as-judge prompts, reference-based metrics). Each entry has owner, version, calibration metrics (typically Matthews Correlation Coefficient against a held-out human-labelled slice), and a "gate-eligible" flag based on whether calibration clears the bar.
- Multi-tenant scheduler. Evaluator runs are queued, rate-limited per tenant, and isolated by quota. One team's backfill cannot starve another team's CI run. LLM-as-judge calls hit the same upstream APIs as production, so the scheduler is also a circuit-breaker against runaway spend.
- Dimensional decomposition. Complex objectives are decomposed into orthogonal dimensions (faithful, helpful, safe, format-conforming), each scored 0 to 1. Composition happens at the scorecard layer, not inside individual evaluators. This prevents double-counting when objectives are reused across features.
- Structured results. Every run emits machine-readable results: per-trace, per-evaluator, per-dimension scores plus judge justifications. The output is consumable by CI, by dashboards, and by replay tools.
- Caching. Identical (input, model, evaluator-version) tuples cache; budget-conscious teams can deduplicate aggressively. Cache invalidation is by evaluator version, by judge model version, or by explicit refresh.
Execution is the layer that gets stressed first. A platform that runs cleanly at 100 evaluator runs per day will buckle at 10,000 without rate limits, caching, and a real scheduler. Design for the buckled state from the start.
Layer 3: CI/CD release gates
The gate layer answers "should this change ship."
- Declarative gates. Each AI feature has a
(or equivalent) in the repo: which dataset snapshot to use, which evaluator panel to run, which dimensions are blocking versus advisory, and the regression tolerance per dimension. - Triggered runs. A merge to a path under the team's AI directory triggers the gate. The gate pulls the latest dataset snapshot the product team has approved, runs the panel, and compares to the baseline.
- Structured output. Pass / fail per dimension, score deltas, top failing examples (with judge justifications), and the run's permalink. The CI surface that engineers already use becomes the evaluation surface.
- Promotion. Passing gates can promote candidates to a staged environment; staged candidates accumulate live evaluator scores; promotion to production requires the staged scores to clear a second, stricter gate.
- Override and audit. Gates can be overridden by an explicit human approver on a specific run; every override is logged, attributed, and surfaced in a weekly review. Quiet overrides are a culture failure that compounds.
A gate that product teams trust is one they cannot route around invisibly. A gate that they do not trust is one that blocks legitimate work too often. The platform's job is to make the gate trustworthy; the calibration metric on the underlying evaluators is the lever that controls trust.
Cross-cutting concerns
- Multi-tenancy. Every layer is tenant-aware: data partitioned by team, quotas enforced per team, and a misconfiguration by one team cannot break another. The platform looks like a small set of stable APIs that each team consumes through SDKs and a self-serve UI.
- Model agnosticism. Objectives and evaluators are independent of the model that satisfies them. Swapping models produces a measurable delta on the scorecard; it does not require platform changes. This is the durability property that lets the platform outlive any single model choice.
- Cost attribution. LLM-as-judge spend is rolled up per team, per evaluator, per dataset, and surfaced on a shared dashboard. Cost visibility is what keeps the eval budget from quietly out-growing the production-inference budget.
- Drift monitoring. Evaluator calibration drift, judge model drift, and dataset coverage drift are first-class metrics with alerts. Drift in any of those is a signal that the platform's quality guarantees have weakened and that re-calibration is owed.
Example
A platform team supporting six product teams, each with at least one LLM-powered feature in production:
- Onboarding. A new product team registers their feature in the platform: declares the objective ("agent should answer expense-report questions correctly"), opts into trace capture, and gets a starter evaluator panel from the catalog. Day-one cost: a CLI command and a 30-minute walkthrough.
- Annotation. The team reviews 200 sampled traces, annotating against four dimensions (faithful, helpful, safe, format-correct). Annotations land in the queue; the platform materialises a first dataset snapshot once 50 traces per dimension are annotated.
- Evaluator generation. Once labels per failure mode reach threshold (typically 10 to 20), the platform generates candidate evaluators (rule-based where possible, LLM-as-judge otherwise), calibrates them against held-out labels, and surfaces the ones that clear the gate-eligibility bar.
- CI integration. The team adds the
action to their prompt directory's CI. A subsequent prompt change runs the calibrated panel against the latest dataset snapshot; a regression on faithfulness fails the merge with a structured diff and links to the worst-failing examples. - Promotion. A passing change deploys to staging. The platform routes a slice of staging traffic through the evaluator panel in shadow mode; staging scores stable for 48 hours trigger an automatic promotion proposal that a human approves.
- Drift. Six weeks later, the judge model the panel uses ships a new checkpoint. Calibration drift on two dimensions trips an alert; the gates that depend on those dimensions downgrade from blocking to advisory until the platform re-calibrates. The product team is notified; the release cadence holds because the override path is explicit rather than invisible.
The platform team did not write a single product-specific evaluator. The product team did not write a single line of pipeline code. The release decision was a function of the objective and the calibrated panel, not of an engineer's read of the logs.
Limitations
- Cannot replace ownership. A platform that lets product teams skip writing objectives ends up scoring everything against generic rubrics that satisfy no one. The objective is the product team's responsibility; the platform's job is to make that objective cheap to measure.
- Annotation is the throughput bottleneck. No amount of platform engineering substitutes for human-labelled data. Plan annotation capacity and tooling from day one.
- LLM-as-judge cost at scale is real. Without sampling, caching, and rate-limiting, evaluation spend can match or exceed production-inference spend. Cost controls are platform features, not optional ones.
- Calibration is continuous. Judge model updates, policy changes, and production distribution shifts all degrade calibration. A platform that does not re-calibrate is a platform whose gates silently lie.
- Evaluation does not replace integration testing or unit testing. It complements them. A feature can pass every traditional test and fail every evaluator; both gate types are necessary.
- Self-serve has a learning curve. Even with good docs, the first product team to use the platform pays an onboarding cost. Invest in templates, working examples, and a designated platform-side partner for early adopters; that cost amortises.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace schema that lets evaluation be reusable across teams.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, on the calibration assumptions any judge-driven gate depends on.
- Matthews Correlation Coefficient, https://en.wikipedia.org/wiki/Phi_coefficient, the agreement metric most platforms use to qualify evaluators before allowing them to gate.
Evidence cap reached at three links. Additional reading:
FAQ
How big should the platform team be to maintain this infrastructure? The initial build is a 2 to 4 engineer-quarter project for an organisation with 5 to 10 product teams. Ongoing maintenance, with good schedulers and good calibration discipline, is typically 0.5 to 1 FTE plus a partial ML or applied-AI partner. The cost scales sub-linearly with product teams; that is the whole point of centralising.
Where do golden datasets live, and who owns them? In the dataset registry, owned by the platform team. Each snapshot is owned by the product team that annotated it. Lifecycle (retention, deprecation, retirement) is set by policy. Treat snapshots as immutable: a "change" produces a new version; the old version remains queryable so historical scores remain reproducible.
How do we keep evaluator costs from exploding? Three controls in order of impact: sampling (do not score every trace), caching (deduplicate identical (input, evaluator-version) tuples), and rate limiting (per-team quotas on the execution service). A shared cost dashboard with weekly review is the cultural lever that keeps the technical controls honest.
What about products that have no traffic yet? Bootstrap with synthetic adversarial examples and treat the resulting evaluator scores as advisory until production data accumulates. The same platform machinery works; the dataset just starts smaller and matures faster as traffic appears.
Should evaluation be in the same CI as unit tests, or separate? Same CI, separate stage. Engineers want one merge gate, not two. Unit tests run in seconds; evaluation can run in minutes; structure the pipeline so unit tests gate cheaply and evaluation gates slightly later in the pipeline. Both are required to pass for a merge; both surface in the same PR view.
How do we handle evaluator drift across model updates? Calibration metrics (MCC, kappa, agreement) are first-class platform signals. A drift over a threshold downgrades the affected evaluators from blocking to advisory automatically, alerts the platform team, and queues a re-calibration job. The release cadence holds because the override path is explicit; the platform fixes the calibration on its own SLO rather than the product team's release window.
Is OpenTelemetry strictly required? Strictly, no; practically, yes. Without a standard schema, every product team's traces have to be normalised separately, every evaluator has to know about every shape, and the platform becomes glue code. OpenTelemetry GenAI conventions are the cheapest standard to adopt and the easiest to migrate off of if a better one emerges.