Updated: 2026-03-18 By: Ari Heljakka
Short answer
An open-source evaluation library is a set of primitives (metrics, judge wrappers, dataset utilities) that you import, configure, and run inside your own codebase. A managed evaluation platform is a service that hosts those primitives plus the missing parts: versioned rubrics, immutable datasets, calibrated judges with drift tracking, scorecard composition, and CI integration. The choice is not "which scorer is better." It is "how much of the evaluation lifecycle do you want to own versus consume."
Key facts
- Definition: Open-source libraries are code-level evaluation primitives; managed platforms wrap those primitives with governance, versioning, calibration, and operational gates.
- When to use: Libraries when the team has ML infrastructure, strict data-residency constraints, or wants full control over scorer internals. Platforms when calibration, versioning, and CI gating need to be ready out of the box.
- Limitations: Libraries push the operational burden onto your team; platforms introduce a vendor dependency and a per-evaluation cost model.
- Example: A team using a library writes its own dataset registry, judge calibration harness, and CI integration; a team on a platform receives those as managed services and pays per evaluation run.
Key takeaways
- The categories are not direct substitutes; they sit at different points on the build-versus-buy axis.
- Most teams underestimate the operational cost of running evaluation infrastructure (calibration, drift detection, dataset versioning, judge maintenance).
- Both categories should normalize scores to 0 to 1 per dimension so dimensions compose into scorecards cleanly.
- Model-agnosticism matters in both: your evaluation criteria should outlive any single model or prompt version.
- The right choice depends less on the scorer implementations and more on who owns the calibration loop and the CI gates.
Definition
An open-source evaluation library is software you install into your application or test pipeline. It typically provides: scorer functions (faithfulness, relevance, toxicity, format checks), wrappers for common LLM judge patterns, dataset utilities (loaders, samplers), and integration with test frameworks. The library runs inside your process or your CI runner. You own the dataset, the rubric, the judge calibration, the versioning, and the gating logic.
A managed evaluation platform is a hosted service that makes versioned objectives the primary artifact. It typically provides: a rubric and dataset registry with immutable versions, a library of pre-built and custom evaluators with calibration tracking, judge agreement metrics against ground-truth labels, scorecard composition with normalized per-dimension scores, drift alerts on judge agreement, and an API for CI integration. The service owns the lifecycle; your team owns the rubrics and the labels.
Both categories implement the same underlying scorer math. The difference is what is included around the math.
When this matters
Choose by which question is binding.
- "We have ML infrastructure and want to extend scorers freely." Library. The primitives are reusable, customizable, and live in your repo.
- "Calibration agreement, dataset versioning, and CI gating need to be ready next week." Platform. Building these well takes months; consuming them is faster.
- "Our data cannot leave our network." Library, or a platform with a self-hosted deployment option. Treat data residency as a hard constraint.
- "We have one team writing evaluators and twenty teams consuming them." Platform. Centralized versioning and shared rubrics scale better when many teams compete for the same evaluation surface.
- "We are a research group and want full reproducibility down to the scorer commit hash." Library, pinned in
. The same applies if regulatory audit demands access to scorer source.
How it works
A library-centered workflow
- Install the library in your repo. Pin the version.
- Author rubrics as data structures (JSON, YAML, or Python). Store them in git.
- Build or import a dataset. Store inputs and expected outputs in git or a small registry.
- Write a test that loads the dataset, runs the system under test, applies scorers, and emits per-dimension scores.
- Integrate the test into CI. Define thresholds; fail the build below them.
- Calibrate judges by running them against a human-labeled subset; compute agreement metrics manually.
- Refresh datasets periodically, label new examples, re-run calibration, and update thresholds.
Every step is in your code, in your repo, and on your runner. You bear the operational cost; you also own the artifacts completely.
A platform-centered workflow
- Define rubrics in the platform. Each rubric is versioned automatically.
- Upload or sync datasets. Each dataset version is immutable and queryable.
- Configure evaluators (deterministic checks and managed judges) against rubrics. Each evaluator has its own version and a calibration set.
- Run evaluations via API; results flow into a scorecard. Per-dimension scores normalize to 0 to 1 so dimensions compose.
- Calibration agreement against human-labeled examples is tracked over time; the platform alerts when agreement decays.
- CI integration calls the platform API; a regression below the threshold returns a non-zero exit code and blocks the merge.
- Datasets, rubrics, and judges are versioned together; every evaluation run records the exact triple it scored under.
Your team writes the rubrics, contributes labels, and configures gates. The platform owns the storage, the calibration tracking, the drift detection, and the agreement metrics.
Example
A team building a customer-facing summarization feature wants regression gates for faithfulness, format compliance, and tone.
Library path. The team imports a popular open-source library. They write a faithfulness check that uses an LLM judge, a format check that uses a JSON schema validator, and a tone check that uses a second judge. They store 200 examples in a
directory, pinned in git. They write a pytest harness that runs the suite on every PR and posts a comment with per-dimension scores. To track judge calibration, they manually label 50 examples, run the judge against them, and compute MCC. They schedule a quarterly recalibration task in their issue tracker. Time to first working gate: about two engineering weeks. Ongoing maintenance: roughly one engineer-day per month, spiking when an underlying judge model changes.
Platform path. The team uploads a rubric for each of the three dimensions, uploads the same 200 examples as a versioned dataset, and configures three evaluators (one deterministic, two judges). The platform tracks calibration agreement automatically and alerts when it decays. CI integration is a single API call from the test runner; the response includes a per-dimension scorecard and a pass-or-fail decision. Time to first working gate: about two engineering days. Ongoing maintenance: rubric and label updates only; calibration tracking and drift detection are operational features of the platform.
Neither path is universally better. The library path is right when ML infrastructure exists, data residency is a hard constraint, or scorer internals must be customizable. The platform path is right when the team's binding constraint is shipping evaluation gates, not building them.
Comparison
| Capability | Open-source library | Managed platform |
|---|---|---|
| Scorer primitives | Built-in or via plugins | Built-in plus custom |
| Rubric versioning | Manual (git) | First-class, immutable, queryable |
| Dataset versioning | Manual (git or registry) | First-class, immutable, snapshot per evaluation run |
| Judge calibration tracking | Custom; you write the harness | Built-in agreement metrics against labels |
| Drift alerts on judge agreement | Custom | Built-in alerts and trend tracking |
| CI integration | Pytest, jest, custom runners | API call from any CI runner |
| Scorecard composition (0 to 1 normalized) | Custom aggregation code | First-class scorecards with weighted dimensions |
| Annotation queue and labeling tooling | External | Built-in or integrated |
| Multi-team rubric reuse | Copy-paste or internal package | Central registry with access control |
| Auditability and reproducibility | Strong (everything in git) | Strong (immutable versions, run records) |
| Data residency | You control it | Depends on deployment (hosted vs self-hosted option) |
| Cost model | Engineering time + judge inference | Per-evaluation pricing + judge inference |
| Upgrade cadence | You pin and upgrade | Platform-managed, with version pinning where offered |
Who should not use a managed evaluation platform
A managed platform is the wrong tier in a few specific cases.
- DIY data-science teams with deep ML infrastructure. Teams that already run their own model serving, dataset registries, and CI harnesses get more leverage from libraries they can extend freely.
- Strict on-premise constraints with no self-hosted option. Hard data-residency or air-gapped environments rule out hosted services. Use a library, or a platform that offers a self-hosted deployment.
- Research workflows requiring full scorer source control. When reproducibility down to a scorer commit hash matters more than operational features, pin a library in your dependency file.
- One-off bespoke metrics with no reuse. If every project has a unique metric implemented once and never updated, the managed lifecycle adds overhead without payoff.
Where each category is stronger
- Libraries are stronger for code-level reproducibility, deep customization, and tight integration with existing test frameworks. Their strength is "every artifact in git, every dependency pinned."
- Managed platforms are stronger for versioned objectives, calibration tracking, drift detection, multi-team rubric reuse, and CI gating as a first-class feature. Their strength is "the operational lifecycle is the product."
Hybrid setups are common: a library for fast iteration inside a single project, a platform for shared rubrics, calibration tracking, and release gates that span teams. The two are not mutually exclusive; the right composition follows from where ownership boundaries fall in the organization.
Limitations
- Libraries push operational cost onto the team. Calibration harnesses, drift alerts, dataset versioning, and judge lifecycle management all become work for your engineers. Underestimating that work is the most common failure mode.
- Platforms introduce vendor dependency. Pin the API contract, prefer exportable artifacts (rubrics and labels as plain files), and confirm self-hosted or migration paths before adopting.
- Both categories drift. Judges drift as underlying models change; datasets drift as user behavior shifts. Drift detection must be a first-class feature, whether you build it or buy it.
- Per-evaluation pricing scales with volume. At very high evaluation volumes, the per-run cost of a managed platform can exceed the engineering cost of a library. Model the cost curve before committing.
- Inline guardrails belong elsewhere. Neither libraries nor managed platforms are designed for sub-100 ms inline decisions in the synchronous path. Use deterministic checks or a small local classifier for that.
Evidence and sources
- BIG-bench: a collaborative benchmark for evaluating LLM capabilities. github.com/google/BIG-bench
- HELM: a holistic framework for evaluating language models. crfm.stanford.edu/helm
- OpenAI evals: deterministic and graded eval harness. github.com/openai/evals
FAQ
Can a library and a platform coexist? Yes, and it is common. Many teams use a library for fast local iteration and a platform for shared rubrics, calibration tracking, and release gates. The boundary is usually ownership: who owns the rubric, who owns the calibration, who owns the gate.
How do I know if I need calibration tracking? If you run an LLM judge and you cannot answer "how well does this judge agree with humans on a known set," you need calibration tracking. Without it, judge scores are unfalsifiable.
Is dataset versioning really necessary? Yes, once you change the dataset more than once. Without versioning, you cannot reproduce a past evaluation result or attribute a score regression to a dataset change versus a system change. Treat datasets as immutable snapshots with explicit version identifiers.
What about judge cost? Both categories incur LLM judge inference cost. Platforms add a service fee on top; libraries make you pay the inference cost directly. Compare total cost (inference + service + engineering time) over a year, not unit cost per run.
How does this relate to LLM observability platforms? Observability platforms answer "what happened." Evaluation libraries and managed evaluation platforms answer "is it correct against the rubric." The two categories compose: traces feed evaluation runs as candidate inputs, and evaluation results enrich trace metadata.