Framework-Coupled vs Framework-Agnostic Evaluation Platforms

Framework-Coupled vs Framework-Agnostic Evaluation Platforms

Updated: 2026-02-15 By: Ari Heljakka

Short answer

Framework-coupled evaluation platforms are tightly bound to one orchestration library: they ingest its traces natively, mirror its abstractions, and ship the fastest first-day experience for teams already using that framework. Framework-agnostic platforms treat orchestration as out-of-scope: they read a standard (OpenTelemetry, OTLP, a thin SDK), score against versioned objectives, and survive whatever orchestration layer the team adopts next. The choice is rarely about features in the abstract; it is about how much of your evaluation harness should survive a framework swap. If the answer is "all of it," go agnostic.

Key facts

  • Definition: A coupled platform is designed around a specific orchestration framework (its tracers, primitives, and concepts are first-class). An agnostic platform reads a neutral interface (e.g. OpenTelemetry) and is indifferent to the framework producing the traces.
  • When to use: Coupled for teams committed to one framework who want the lowest-friction integration. Agnostic for teams with multiple frameworks, custom orchestration, or a stated requirement that evaluation outlive any single tech choice.
  • Limitations: Coupled platforms get worse as you add a second framework. Agnostic platforms require more upfront instrumentation and lose framework-specific affordances unless the agnostic schema can carry them.
  • Example: A team that runs one orchestration library across its stack accepts coupling for speed. A team running two or three frameworks (and planning to add custom agent code) standardises on OpenTelemetry GenAI conventions and picks an agnostic evaluator layer.

Key takeaways

  • Coupling buys speed today and pays for it later: the cheaper the integration, the more of the evaluation harness is welded to one framework.
  • An evaluation harness should be portable across the things it evaluates. If the harness moves every time the framework moves, the harness is not the source of truth; the framework is.
  • Model agnosticism is the analogue of framework agnosticism. Both rest on the same principle: separate the objective from its implementation.
  • The cost of agnosticism is upfront instrumentation. The cost of coupling is migration friction every time the orchestration story changes.
  • Most teams underestimate how often the orchestration story changes. A two-year-old agent codebase rarely uses the same framework it started with.

Definition

Framework-coupled evaluation platforms treat one orchestration library as their primary input. The platform's data model mirrors the framework's primitives (its chains, graphs, runs, traces, datasets). Instrumentation is one import and one decorator; traces appear immediately; evaluators run natively against the framework's run objects. The price of that convenience is that anything outside the framework is a second-class citizen: imports get noisier, abstractions blur, and the agnostic path is rarely the recommended one.

Framework-agnostic evaluation platforms treat orchestration as out-of-scope. The platform consumes a neutral interface: typically OpenTelemetry GenAI spans, OTLP, or a small SDK that emits (input, output, metadata) tuples. The platform's data model is built around objectives, ground-truth datasets, and calibrated evaluators (none of which know which framework produced the trace). The platform survives any framework swap because it never depended on the framework in the first place.

The cleaner mental model is that an evaluation platform either centres on traces from a specific framework or centres on objectives that any framework can be scored against. The first is faster to start; the second is more durable to operate.

When this matters

The split matters once any of the following is true:

  • You run more than one orchestration framework. A coupled platform will give one framework the deep experience and treat the other as a manual integration.
  • You have custom agent code that does not use a framework. A coupled platform requires you to fake the framework's abstractions or fall back to its lowest-common-denominator SDK.
  • You expect to swap frameworks. The half-life of orchestration frameworks in production is short. Coupled instrumentation has to be reworked on every swap; agnostic instrumentation does not.
  • You want your evaluation criteria to outlive any single model or framework choice. This is the model-agnosticism principle applied to orchestration: the objective and ground-truth dataset should be stable while implementations rotate underneath.
  • You operate in a regulated context. Auditors want evidence that quality measurement is independent of the systems being measured. A platform welded to the runtime is harder to defend than one that reads a neutral schema.

If none of these hold, coupling is a reasonable tradeoff for speed.

How it works

Framework-coupled platforms

A typical coupled integration:

  1. Native instrumentation. Import the platform's wrapper for the orchestration library; framework runs are auto-traced; chain, agent, and tool calls become first-class entities.
  2. Framework-aware UI. The trace view renders the framework's abstractions directly: nodes, edges, state machines, retriever chains.
  3. Framework-tied dataset format. Datasets and evaluators are expressed in the framework's run schema; reusing them outside the framework requires translation.
  4. Calibration tools. Some coupled platforms ship dedicated calibration UIs, but the resulting calibration artefacts are typically also expressed in the framework's vocabulary.

The first-day experience is unbeatable. The third-year experience depends on whether the framework is still the right answer.

Framework-agnostic platforms

A typical agnostic integration:

  1. Standard instrumentation. Emit OpenTelemetry GenAI spans (or push OTLP) from the application, regardless of orchestration. Custom code, multiple frameworks, and home-grown agents all produce the same span schema.
  2. Objective-centric data model. The platform stores objectives, ground-truth datasets, and evaluator versions as first-class artefacts that reference traces by ID. The trace is an input to the score, not the home for it.
  3. Composable evaluators. Rule checks, LLM-as-judge, and reference-based metrics are versioned components addressable by API and reusable across frameworks.
  4. Dimensional decomposition. Complex objectives (e.g. "is the agent's response good") are decomposed into orthogonal dimensions (faithful, helpful, safe, format-conforming), each normalised to 0 to 1 and combined into a scorecard.
  5. Release gates. CI calls the same evaluator API regardless of which framework produced the trace; pass/fail decisions hinge on objective scores rather than on framework-specific signals.

Day-one friction is higher (someone has to wire OpenTelemetry properly). Year-three friction is lower (a framework swap touches the instrumentation, not the evaluation harness).

The instrumentation seam

In both categories, the seam between application code and evaluation is the place worth designing carefully:

  • Coupled platforms put the seam inside the framework. If the framework changes, the seam moves.
  • Agnostic platforms put the seam at the wire format (OTLP / OpenTelemetry GenAI). If the framework changes, the seam stays.

That is the entire architectural argument compressed. Everything else follows.

Example

A team operating two LLM-powered surfaces:

  • A chat assistant built on one orchestration framework.
  • A retrieval pipeline built with a different orchestration library and some custom Python.

Framework-coupled path. The team picks an evaluator platform native to the chat assistant's framework. Chat traces appear instantly with rich annotations. The retrieval pipeline requires writing adapters to fake the framework's run schema; the team eventually gives up and runs a second, lighter eval workflow for retrieval. Six months later they migrate the chat assistant to a different framework for performance reasons; the evaluator integration is rewritten end-to-end.

Framework-agnostic path. The team standardises on OpenTelemetry GenAI conventions and instruments both surfaces to emit spans with the same fields (input, retrieved context, output, model, tokens, latency). The agnostic platform reads the same span schema from both. Datasets and evaluators are defined once and applied to both surfaces. When the chat assistant migrates frameworks, the spans are still the spans; the evaluation harness does not change. The cost is two weeks of careful instrumentation up front, paid once.

The two paths diverge most clearly at the migration boundary. The coupled path is faster until the first migration; the agnostic path is faster on every migration after the first.

Criterion-by-criterion view

CriterionFramework-coupled platformsFramework-agnostic platforms
First-day setupWins. One import, one decorator.Loses. Requires explicit instrumentation.
Multi-framework supportLoses. One framework is first-class, others are second.Wins. All frameworks produce the same span schema.
Custom-agent supportLoses. Custom code has to mimic the framework's abstractions.Wins. Custom code emits spans like anything else.
Lock-in surface areaHigh. Datasets, evaluators, and dashboards reference framework primitives.Low. Artefacts reference objectives and trace IDs.
Migration costHigh. A framework swap rewrites the instrumentation.Low. A framework swap rewrites only the producer of spans.
Model agnosticism alignmentPartial. Often supports model swaps but ties them to framework runs.Strong. Models are an implementation detail under a stable objective.
Vendor neutralityLow. Tied to the framework's lifecycle and roadmap.High. The wire format is an open standard.
CI/CD integrationAvailable, framework-shaped.Available, objective-shaped.
Operational ownerOften the team that owns the framework choice.Often a platform / quality team that owns evaluation across surfaces.
Day-one experienceExcellent for the dominant framework.Adequate; requires deliberate setup.
Year-three experienceDegrades with each new framework added.Stable across additions and migrations.

The pattern is consistent: coupling buys speed where the framework choice is fixed; agnosticism buys durability where the framework choice will change. The right answer depends on whether the team can credibly commit to one framework for the lifetime of the evaluation harness.

Limitations

Who should not adopt a framework-coupled platform

  • Teams running more than one orchestration framework today.
  • Teams with substantial custom (non-framework) agent code.
  • Teams whose stated requirement is that quality measurement outlive any single tech choice.
  • Regulated teams whose auditors expect evaluation independence from the runtime under test.

Where each category is stronger

  • Coupled wins for teams committed to one framework, for the first-day experience, and for surfaces where the framework's abstractions are the natural unit of debugging.
  • Agnostic wins for multi-framework deployments, custom agent code, long-lived evaluation harnesses, and any case where the objective is supposed to be stable while implementations rotate.

Honest tradeoffs

Both approaches have honest tradeoffs:

  • Coupled platforms can hide cost. The integration is cheap until the second framework arrives. The hidden cost is the moment when the team has to choose between rewriting evaluation or letting one surface go un-scored.
  • Agnostic platforms can lose framework-specific affordances. A framework's first-class entities (its retriever chains, agent state machines, tool routers) often do not survive the trip through a generic OpenTelemetry span unless someone deliberately encodes them as span attributes. That work has to be done somewhere.
  • OpenTelemetry GenAI conventions are still evolving. Agnostic platforms that depend on the standard inherit its growing pains. A practical agnostic platform supports the standard plus a few stable extensions until the standard catches up.
  • Calibration is independent of either choice. Whether the platform is coupled or agnostic, the judge has to be calibrated against a labelled ground-truth set and re-calibrated when the model or policy changes. Neither category does this for you automatically.
  • Migration cost is unavoidable somewhere. Coupled platforms pay it at framework swap. Agnostic platforms pay it at instrumentation. The question is when and how often, not whether.

Evidence and sources

Evidence cap reached at three links. Additional reading:

FAQ

If I am all-in on one orchestration framework, do I lose anything by going agnostic? You lose a small amount of out-of-the-box convenience: the framework's abstractions will not auto-render in the eval platform's UI without some attribute mapping. You gain the ability to keep the same evaluation harness if you change your mind in a year. Many teams pick agnostic specifically to preserve that optionality, even when they have only one framework today.

Can a coupled platform also read OpenTelemetry? Often yes, but the OpenTelemetry path is rarely the recommended one. The native integration is what the product team optimises for; the standards path is what they maintain. That asymmetry is the operational risk.

Is framework agnosticism the same as model agnosticism? They are the same principle applied at different layers. Model agnosticism says the objective should outlive any single model. Framework agnosticism says the objective should outlive any single orchestration choice. Both rest on separating "what to measure" from "how the measurement is implemented."

What if my framework already has a built-in eval surface? That is the strongest form of coupling: the eval surface is part of the framework. It is fine for prototyping. It tends to be inadequate once you need calibrated judges, versioned datasets, dimensional decomposition, and CI gates. At that point, most teams add an external evaluator (coupled or agnostic) and the built-in surface becomes a debugging UI rather than the source of truth.

How do I migrate from coupled to agnostic without a forklift? Start by instrumenting OpenTelemetry alongside the coupled integration; both layers can read the same calls. Move objectives, datasets, and evaluators to the agnostic platform first, keeping the coupled platform as a debugging UI. Decommission the coupled integration only after the agnostic path has carried a full release cycle, including a calibration refresh.

Related reading