Updated: 2026-03-05 By: Ari Heljakka
Short answer
LLM evaluation tooling for AI agents falls into four archetypes that solve different problems and have different costs at scale. Evaluation-first platforms treat evaluators as versioned components and gate deployments on them. Framework-coupled tools live inside an agent framework and inherit its model of the world. Workbench tools optimize for prompt and dataset iteration, often offline. Observability-first platforms ingest production traces and bolt evaluators on top. Choosing the right archetype is more consequential than choosing the right vendor inside any archetype. The wrong archetype is hard to escape because each one rewards a different way of working day to day, with a different trace format, a different evaluator authoring path, and a different deployment model. Pick the archetype first, then the implementation.
Key facts
- Definition: A tool archetype is a category of LLM evaluation tooling defined by what it treats as the primary artifact (evaluator, framework run, prompt experiment, trace) and what kind of day-to-day workflow it imposes on the team.
- When to use: Whenever a team is choosing or replacing an evaluation tool and wants to avoid optimizing for the wrong category before the demo loop starts.
- Limitations: Real tools are not pure archetypes; most platforms occupy a primary archetype and reach into a secondary one. The archetype frame describes the centre of gravity, not the full feature surface.
- Example: A team that adopts an observability-first platform and then tries to make it the source of truth for evaluators usually ends up authoring evaluators in two places. The archetype mismatch is the cause; no amount of feature work fixes it.
Key takeaways
- The four archetypes are not interchangeable. Each rewards a different theory of where the evaluator lives and what the trace is for.
- Evaluation-first platforms treat the evaluator as the durable artifact. Observability-first platforms treat the trace as the durable artifact.
- Framework-coupled tooling is the cheapest day-one and the most expensive at framework swap.
- Workbenches are excellent for offline iteration and weak as production gates. Production gating needs the evaluator to run on every change and every sampled trace, with versioned lineage.
- The archetype choice constrains the trace format, the evaluator authoring path, and the deployment model. Pick it deliberately.
Definition
An LLM evaluation tool archetype is the centre of gravity of a tooling category: what the platform treats as the primary artifact, how that artifact is versioned, how it is composed, and what operational workflow it expects. The four archetypes below are not mutually exclusive in feature lists, but they are mutually exclusive in the shape they impose on the team that adopts them.
The frame matters because the cost of moving between archetypes is structural, not cosmetic. Evaluators authored in a workbench UI do not migrate to a code-first evaluation-first platform without rewriting. Traces emitted to a proprietary trace store do not transfer to an OpenTelemetry-native stack without translation. The archetype choice is the choice with the longest half-life.
When this matters
The archetype frame becomes decisive when:
- A team is choosing its first serious evaluation tooling and the contract is multi-year.
- A team has outgrown its current tool and is deciding whether the next thing is a step inside the same archetype or a step into a different one.
- A team is operating two tools that occupy different archetypes and the cost of authoring evaluators twice has become visible.
- A buyer is comparing platforms with overlapping feature lists and needs a frame that survives the next vendor release.
If the team is doing throwaway prototypes, any archetype will do. Once evaluation outputs are gating deployments or feeding compliance reports, the archetype choice shapes the work for years.
How it works
Archetype 1: Evaluation-first platforms
The durable artifact is the evaluator: a versioned component that produces a normalized score against a defined dimension. Evaluators are first-class objects with versions, owners, calibration datasets, and lineage. The platform's primary workflow is "author an evaluator, calibrate it against a ground-truth dataset, run it as a gate on every change, monitor it on a sampled stream of production traces."
What it is good at:
- Treating evaluators as managed components with versioning, calibration, and lineage.
- Composing independent dimensions into a scorecard, with each dimension scored on a normalized 0-to-1 scale.
- Gating deployments on evaluator outputs and surfacing score drift as an operational signal.
- Cross-implementation portability: the same evaluator runs against any model, prompt, or framework.
What it is weaker at:
- Day-one velocity for teams that have not yet defined "good." It assumes the team can articulate dimensions.
- Visualizing arbitrary trace structure when the agent has unusual control flow.
- Being the only tool in the stack for teams that also need rich trace exploration.
Archetype 2: Framework-coupled evaluation tools
The durable artifact is the framework run. Evaluation is a feature of the agent framework itself: built-in callbacks, prompt registries, and run dashboards live in the same code path as the agent. The platform's primary workflow is "develop in the framework, run the framework's evaluation feature, view results in the framework's dashboard."
What it is good at:
- Lowest day-one cost. The evaluation primitives are already there.
- Tight coupling to framework internals (chains, tools, retrievers) so evaluation can see what the agent is doing.
- Strong defaults for the framework's idiomatic patterns.
What it is weaker at:
- Framework portability. Evaluators authored against the framework's internals do not survive a framework swap or a multi-framework deployment.
- Multi-team adoption. The platform looks like the framework, which works for the team that uses it and not for the team next door.
- Audit posture. Evaluation lineage is usually entangled with the framework's run log, not stored as a separate, queryable artifact.
Archetype 3: Workbenches
The durable artifact is the experiment: a tuple of prompt, dataset, model, and resulting score. The platform's primary workflow is "load a dataset, define an evaluator, run a sweep across prompt or model variants, compare scores side by side." It is the experimental science instrument for prompt and model iteration.
What it is good at:
- Offline iteration on prompts and models against fixed datasets.
- Quick A/B comparisons across many variants.
- Surfacing the regression on a known dataset when a prompt changes.
What it is weaker at:
- Production gating. The workbench shape (load, run, compare) is interactive; a deployment gate needs to be unattended, versioned, and emit lineage.
- Continuous evaluation on a live stream. Most workbenches are not built to sit in the request path or in a sampling tap.
- Dimensional decomposition over time. Workbenches optimize for a single comparison surface, not for tracking many dimensions across many releases.
Archetype 4: Observability-first platforms
The durable artifact is the trace. The platform ingests production traffic, persists structured spans, and provides query, search, and visualization over the trace store. Evaluation is added on top: a span attribute that carries an evaluator score, a sampled-and-scored view, an alert on a threshold.
What it is good at:
- Operating on the production stream at scale. Search, alerting, retention, lineage at the trace level.
- Adopting OpenTelemetry GenAI semantic conventions to make instrumentation portable.
- Visualizing the agent's actual control flow on real traffic.
What it is weaker at:
- Treating evaluators as first-class artifacts. Evaluators are usually bolted-on transformations of the trace, not versioned components in their own right.
- Calibration workflow. The platform's centre of gravity is the span, not the evaluator-versus-ground-truth comparison.
- Acting as the source of truth for "what should be measured." It excels at "what was measured," which is a different question.
Example
A platform team is selecting tooling for three internal agent products. Each product has different gravity, and the archetype frame helps the team avoid one-size-fits-all procurement.
- Product A: a customer support agent under regulatory pressure, multi-year support window, audit obligations. The team chooses an evaluation-first platform as the source of truth and an OpenTelemetry-native trace store underneath it. Evaluators are versioned in code; lineage queries are part of the audit story.
- Product B: an internal research assistant inside a popular agent framework. The team uses the framework-coupled evaluation feature for day-to-day work and accepts that any future framework swap is a known cost.
- Product C: a prompt-iteration sandbox for the prompt-engineering team. The team uses a workbench tool to compare prompt variants against curated datasets and pushes the winning prompts into Product A's evaluator suite for production gating.
The team operates three archetypes consciously. They do not try to make any one tool be all three.
Side-by-side comparison
| Property | Evaluation-first | Framework-coupled | Workbench | Observability-first |
|---|---|---|---|---|
| Primary artifact | Evaluator | Framework run | Experiment | Trace |
| Versioning unit | Evaluator + dataset | Framework chain | Prompt + dataset | Span + attribute |
| Strongest at | Continuous gating | Day-one velocity | Offline iteration | Production search |
| Weakest at | Day-one velocity | Framework portability | Production gating | Evaluator as artifact |
| Audit posture | Strongest | Weakest | Weak | Strong on trace, weak on evaluator |
| Lock-in shape | Evaluator API | Framework internals | Workbench schema | Trace store |
| Best fit | Long-lived agent under audit | Single-framework deployment | Prompt-engineering loop | High-volume production traffic |
Who should not use a hosted evaluation-first platform
Teams whose product is one prompt against one model, who do not yet have a notion of dimensional decomposition, and who do not face audit or compliance pressure usually get more value from a workbench or a framework-coupled tool. The evaluation-first archetype rewards a team that has thought about what to measure.
Where each archetype is stronger
- Evaluation-first wins on long-lived agents, audit-grade lineage, and multi-framework deployments where the evaluator suite must survive the next framework rewrite.
- Framework-coupled wins on time-to-first-result for teams committed to one framework and willing to pay the swap cost later.
- Workbenches win on prompt iteration cadence and offline A/B comparison, especially in research and early-product phases.
- Observability-first wins on production traffic at scale, OpenTelemetry portability, and incident response, especially when paired with an evaluation-first source of truth.
Limitations
- Real tools mix archetypes. Most observability-first platforms ship evaluator features; most evaluation-first platforms ship trace ingestion. The archetype describes the centre of gravity, not the full surface.
- The right answer is often two tools. A workbench plus an evaluation-first platform, or an observability-first platform plus an evaluation-first source of truth, is a common shape.
- Archetypes evolve. The boundary between observability-first and evaluation-first is closing as both sides adopt OpenTelemetry semantic conventions for evaluation spans.
- Pure feature comparison hides the archetype. Two platforms can both have an "evaluator" feature and require entirely different day-to-day workflows from the team.
- The wrong archetype is hard to leave. Migration cost is the cost of authoring evaluators or traces in a different shape; it is not a feature gap.
Evidence and sources
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, the shared schema that makes cross-archetype interoperability possible.
- "A Survey on LLM-as-a-Judge," 2024, https://arxiv.org/abs/2411.15594, for the calibration and dimensional decomposition concepts that evaluation-first platforms operationalize.
- NIST AI Risk Management Framework, https://www.nist.gov/itl/ai-risk-management-framework, for the audit-posture criteria that distinguish archetypes for regulated buyers.
FAQ
Can one tool be all four archetypes? Feature lists sometimes claim so; operational reality rarely supports it. The archetype is the centre of gravity, and centres of gravity do not stack.
Is OpenTelemetry support enough to make a tool evaluation-first? No. OpenTelemetry support is necessary for observability portability; it does not make evaluators first-class. An evaluation-first platform treats the evaluator as a versioned, calibrated artifact independently of the trace store.
Where do LLM judges fit? Across all four archetypes. The relevant question is whether the judge is treated as a managed, versioned evaluator (evaluation-first) or as a callable function inside a notebook (workbench) or a transformation on the span stream (observability-first).
What if my team only needs offline experiments? A workbench is the right archetype. Adding an evaluation-first platform on top is overkill until the team is gating deployments on the results.
How does this interact with build versus buy? The archetype frame applies to both. A self-built evaluation-first stack on top of OpenTelemetry is a legitimate choice; so is a hosted workbench plus a hosted evaluation-first platform. The archetype is the decision; the build-versus-buy axis is orthogonal.