Updated: 2026-01-08 By: Ari Heljakka
Short answer
Framework-bundled observability (the trace viewer that ships with an orchestration library) is the right starting point for a single-framework prototype. Teams outgrow it when agents span multiple frameworks, when custom orchestration is added, when evaluation needs to gate deploys, or when audit and compliance require an artifact independent of the runtime. Beyond the framework, teams need three things the bundled tools rarely provide: a vendor-neutral trace store, versioned evaluators that survive a framework swap, and a single place where objectives live independent of any one orchestration library. The migration is not about replacing the framework; it is about decoupling the harness from the framework.
Key facts
- Definition: Framework-bundled observability is the trace, replay, and evaluation surface shipped alongside an orchestration library. It is tightly coupled to that library's primitives and is fastest to use inside a single-framework codebase.
- When to use: Single framework, single team, single agent, mostly research or prototype. The bundled tool is enough.
- Limitations: Cross-framework traces, custom orchestration, deployment gating, audit lineage, and multi-team adoption all stress the coupling. The harder the problem, the more the framework coupling shows.
- Example: A team prototypes an agent inside one orchestration library, gets to production, then adds a second agent in a different framework. The bundled observability now covers one half of the system; the other half is dark.
Key takeaways
- Framework-bundled observability is a feature of the framework, not an independent system. Its data model, retention, and surface area follow the framework's roadmap.
- The cheapest integration is also the deepest coupling. Switching the orchestration library means switching the observability stack.
- Beyond the framework, teams need OpenTelemetry-style traces, versioned objectives, and managed evaluators that read traces from any source.
- Audit and compliance are the most common forcing functions. Bundled observability is rarely built for "show me the rubric version that gated this output."
- The migration path is incremental: standardise on OTLP, write objectives as artifacts, run evaluators against any trace source, then the bundled tool becomes one input among several.
Definition
Framework-bundled observability is the set of trace, replay, and (often) lightweight evaluation features that ship as part of an orchestration library or framework. Examples include the trace UI that accompanies popular chain-and-graph libraries, the run inspector tied to a specific agent framework, and the dataset comparator built into a workbench bundled with an SDK. The bundled tool is optimised for the framework that ships it: its data model mirrors that framework's primitives, its dashboards know about its abstractions, and its first-day experience inside that framework is excellent.
Framework-independent observability is the trace, evaluation, and gating layer that reads from any source. Its data model is built around objectives, scored samples, and (usually) OpenTelemetry GenAI spans. It is indifferent to which orchestration produced the trace, and survives a framework swap because it never depended on the framework in the first place.
The two are not exclusive. Many teams keep the bundled tool as the first stop for framework-native debugging while running an independent layer for evaluation, gating, and audit. The question is which one is the system of record.
When this matters
The bundled tool starts to creak when one or more of these conditions holds:
- More than one framework. A second agent in a different framework forces a choice: another bundled tool, or a layer above both. Two bundled tools mean two trace stores, two evaluator catalogues, two dashboards, two retention contracts.
- Custom orchestration. Bespoke agent code or a thin internal framework usually does not get a first-class bundled experience. Instrumentation gets noisy, abstractions blur, and the bundled tool's affordances no longer match the runtime.
- Deployment gating. Bundled tools usually surface scores as another dashboard column, not as a CI contract. Gating a deploy on a regression in a named dimension typically requires an external system holding versioned objectives.
- Audit and compliance. Regulated surfaces need to answer "which versioned rubric, judged by which versioned evaluator, produced the decision that gated this output." Bundled tools rarely treat the rubric and the judge as first-class versioned artifacts.
- Multi-team adoption. Once more than one team ships an agent, the observability layer becomes shared infrastructure. Coupling shared infrastructure to one framework concentrates organisational risk on that framework's choices.
- Long-lived agents. A two-year-old agent codebase rarely uses the same framework it started with. The observability layer that survives the migration is the one that did not depend on the framework.
If none of these holds, the bundled tool is fine. If any does, the cost of staying inside the framework starts to compound.
How it works
What the bundled tool gives you
A bundled observability surface typically includes:
- Native trace capture. Every framework primitive emits a span automatically. Instrumentation is one import, sometimes zero.
- Replay and inspection. Each run can be opened, walked, and re-executed against a different prompt or model.
- A run comparator. Two runs (or two prompts, two models) sit side by side with token, latency, and cost diffs.
- A lightweight evaluator catalogue. Often LLM-as-a-judge prompts wrapped in the framework's primitives, scored against the framework's runs.
- A dataset surface. Curated inputs that can be replayed and scored.
For a single-framework codebase, this surface is hard to beat. The data model is exact, the integration is invisible, and the iteration loop is short.
What the bundled tool does not give you
Most bundled surfaces stop at the framework's edge:
- Cross-framework traces. A second framework's spans are either invisible or appear as opaque blobs. The trace tree is incomplete across boundaries.
- First-class rubric versioning. The rubric is usually embedded in an evaluator definition. Querying "which version of which rubric produced this score" is not native.
- Judge versioning and calibration tracking. A judge is usually a named evaluator; its model, prompt, and threshold are not pinned as versioned artifacts, and its agreement with human labels is rarely a tracked metric.
- CI gating. Scores can flow to webhooks, but the contract for "this deploy is blocked because this dimension regressed against this dataset" is glue code.
- Audit lineage. A regulator asks for the rubric version, judge version, and dataset version behind a decision. The bundled tool answers with a trace ID and an evaluator name.
- Storage portability. Trace data lives in the bundled tool's store. Exporting it is a project; retention is bounded by the tool's contract.
These gaps are not flaws of any particular framework. They are properties of a tool whose primary user is the framework's developer, not an evaluation engineer or a compliance auditor.
The independent layer
Beyond the framework, the observability surface has three additional ingredients:
- A vendor-neutral trace store. OpenTelemetry GenAI semantic conventions give every framework a way to emit spans that the layer above can read uniformly. The trace store is no longer the framework's store; it is whatever the team chose, and the data is portable.
- Versioned objectives and managed evaluators. Each rubric is an artifact with a version, an owner, and a calibration dataset. Each evaluator is a managed component with a pinned model, prompt, and threshold. Running the evaluator against a trace produces a score with explicit lineage.
- A scorecard contract. A deployment is gated on a scorecard against a held-out dataset, scored by the managed evaluators. The contract is queryable: which rubric version, which evaluator version, which dataset version, which score.
The independent layer reads traces from the bundled tools as one input among many. The bundled tools continue to do what they do best (framework-native debugging) while the independent layer holds the system of record.
The migration path
Teams rarely migrate in a single jump. The usual path:
- Standardise on OpenTelemetry GenAI spans. Add OTLP export to the framework. The bundled tool keeps working; the spans now also flow to a vendor-neutral collector.
- Write objectives as artifacts. Move rubrics out of evaluator code and into versioned files. Each rubric gains an owner, a version, and a calibration dataset.
- Run evaluators against the OTLP stream. Pick managed evaluators whose model, prompt, and threshold are pinned. Score production traces and curated sets through the same evaluators.
- Gate CI on the scorecard. Introduce a deployment hook that blocks on regression against the curated set. The bundled tool's score view becomes informational; the gate is the contract.
- Move audit lineage into the independent layer. Lineage queries (rubric, evaluator, dataset) are answered by the layer that holds those artifacts, not by the trace store.
At each step the bundled tool keeps working. The migration is additive, not destructive.
Example
A team ships a customer-support agent inside one orchestration framework. The bundled observability is great: every chain run is a trace, prompts and tool calls show up in a clean tree, a built-in evaluator scores faithfulness.
Six months later, three things happen:
- A second agent is built in a different framework for an internal use case.
- A regulator asks for evidence of policy compliance over the last quarter.
- A new prompt regression causes a 6 percent drop in faithfulness that nobody catches for two weeks because the bundled dashboard scrolls past it.
Each issue is a different gap in the bundled tool:
- Cross-framework: the second agent's traces are not in the bundled store. The team adds a second bundled tool. Now there are two stores, two evaluator catalogues, two retention contracts.
- Audit: the regulator wants the rubric version and judge version behind a sample of decisions. The bundled tool stored the score as a span attribute. The rubric lives in evaluator code. Reconstructing lineage takes a week.
- Gating: the faithfulness drop was visible in the dashboard but did not block deploy. The fix is a CI gate, which requires a scorecard against a held-out dataset, which the bundled tool does not natively produce.
The team's response is incremental: OTLP from both frameworks, rubrics moved into versioned files, managed evaluators run against the OTLP stream, scorecards against a curated set gating deploys. The bundled tools stay in place for framework-native debugging. The independent layer becomes the system of record for evaluation and audit.
Comparison
Where each category is stronger
Framework-bundled observability is stronger when
- The codebase is committed to a single framework and likely to stay that way.
- Most failures are debugged by replaying a run in the framework's own primitives.
- Evaluation is sampled and informational, not a deployment gate.
- Audit needs are light and the trace ID is sufficient lineage.
- Team size is small and one tool is enough.
Independent observability is stronger when
- Multiple frameworks or custom orchestration are in play.
- Deployments are gated on a scorecard against a held-out dataset.
- Audit requires explicit lineage from decision to rubric and evaluator version.
- Multiple teams ship agents and need shared infrastructure.
- The lifetime of the agent is expected to outlive any single framework choice.
Most teams that ship production agents end up with both, used for different things.
Who should not use a hosted eval-first platform
A team running a single research agent inside one framework, with no production traffic and no audit obligations, will get more from the bundled tool than from a hosted eval-first platform. The hosted platform's value (versioned objectives, calibration tracking, scorecard gating) only materialises once there are objectives to version, scores to track, and deploys to gate. Without those, the hosted platform is a more expensive trace viewer.
Limitations
- Independent layers cost more to start. OTLP instrumentation, rubric files, calibration datasets, scorecards, gates: all of these are work that the bundled tool would have skipped on day one. The cost is paid up front and amortised over the life of the agent.
- Bundled tools sometimes have better defaults inside their framework. A framework-native trace viewer often shows abstractions (chains, graphs, agent state) that the independent layer does not understand without extra schema. The fix is usually to keep the bundled viewer for inspection and let the independent layer handle scoring and gating.
- Calibration data is the hidden cost. Versioned rubrics and managed evaluators are only as good as their calibration datasets. Building and maintaining a labelled ground-truth set is the work that bundled tools quietly skip.
- Migration is gradual and incomplete. Most teams end up with the bundled tool, a vendor-neutral collector, and an independent evaluation layer running in parallel. The art is keeping the boundaries clean.
- An OTLP stream is not a scorecard. Standardising on OTLP solves portability and does not solve evaluation. Without managed evaluators reading the stream, the team still has a trace store.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, the shared schema that makes framework-independent trace capture possible.
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, on why judge calibration must itself be a tracked metric over time.
- "Holistic Evaluation of Language Models," Liang et al., 2022, https://arxiv.org/abs/2211.09110, for the dimensional decomposition pattern that scorecards depend on.
FAQ
Is leaving the bundled tool the same as leaving the framework? No. The bundled tool and the framework can be decoupled: the framework keeps producing chains and graphs; the bundled tool keeps showing them; a separate trace export feeds the independent observability layer. The framework choice is unchanged.
If I am happy inside my framework, do I need an independent layer at all? If audit, gating, multi-framework support, and long lifetime are not concerns, no. The bundled tool is enough. The independent layer's value shows up when one of those concerns becomes critical.
Can the bundled tool grow into an independent layer? Some have. The question is whether the data model is centred on the framework or on objectives. A tool whose primitives are chains and runs has a long path to a tool whose primitives are objectives and scorecards.
Does standardising on OpenTelemetry solve the problem? It solves trace portability and does not solve evaluation. OTLP gives every framework a uniform export; the layer above OTLP still has to hold the rubrics, run the evaluators, and produce the scorecards.
What is the smallest first step away from the bundled tool? Move one rubric into a versioned file with an owner and a calibration dataset, run one managed evaluator against it, and add a single CI gate for that one dimension. The cost is small, the audit story is immediate, and the migration is unblocked.
How do I avoid running two evaluator catalogues in parallel? Pick one as the source of truth (almost always the independent layer for objectives, the bundled tool for framework-native debugging). The bundled tool may keep cheap heuristics for ad-hoc checks; the independent layer holds the rubrics that gate deploys and feed audit.