Agent-First vs LLM-First Evaluation Platforms

Agent-First vs LLM-First Evaluation Platforms

Updated: 2026-01-04 By: Ari Heljakka

Short answer

Evaluation platforms split along a hidden architectural axis: what they treat as the native unit of evaluation. LLM-first platforms treat the unit as a single prompt-response pair; everything else (sessions, agents, tools) is bolted on top. Agent-first platforms treat the unit as a full trajectory of model calls, tool calls, and state transitions resolving one user goal; single-call scoring is a degenerate case of trajectory scoring. The choice is not a feature checklist; it determines the data model, the evaluator API, the dashboard, the CI gate, and how cost scales with traffic. Picking the wrong stance for your workload is the source of most "our eval platform does not fit our agent" pain.

Key facts

  • Definition: Agent-first platforms treat trajectories (sequences of model calls, tool calls, retries, side effects) as the unit of evaluation. LLM-first platforms treat single prompt-response pairs as the unit, with sessions modeled as collections of independent rows.
  • When to use: Agent-first when your system has tool calls, multi-turn state, or goal-level success criteria. LLM-first when your workload is single-call (classification, extraction, prompt iteration on a static dataset).
  • Limitations: Agent-first platforms cost more per scored unit and feel heavy for single-call workflows. LLM-first platforms hide multi-step failure modes (tool misuse, context loss, goal drift) and force teams to reimplement trajectory logic outside the platform.
  • Example: A support agent with five tool calls per session: an LLM-first platform scores each model call independently; an agent-first platform scores the full trajectory against goal completion and per-step tool fidelity, producing one trustworthy aggregate plus per-dimension drill-down.

Key takeaways

  • The unit of evaluation is an architectural decision, not a feature toggle. It propagates through ingestion, storage, scoring, dashboards, and gates.
  • LLM-first platforms are optimal for static-dataset experiments and single-call workloads; they degrade gracefully into "spreadsheet with scores."
  • Agent-first platforms are optimal for trajectory-shaped workloads; they need richer ingestion and structured tool capture to earn their cost.
  • Most teams that picked the wrong stance discover it when they cannot answer "did the agent achieve the user's goal" from the platform's native data model.
  • The right cut is workload-driven, not vendor-driven. The same scoring primitives (a managed evaluator, a rubric, a ground-truth dataset) live in both categories; the surface they sit on is what differs.

Definition

An LLM-first evaluation platform is built around the assumption that the evaluable unit is a single prompt-response pair (or input-output row). The data model is a flat table of rows; sessions, threads, and agents are derived views over the flat table. Evaluators run per row. Dashboards aggregate row-level scores. The CI integration uploads a dataset of rows and gates on row-level pass rate. This stance excels on single-call workloads: classification, extraction, summarisation against a static dataset, and prompt-iteration loops where the comparison axis is "this prompt versus that prompt on the same N rows."

An agent-first evaluation platform is built around the assumption that the evaluable unit is a trajectory: a tree of spans capturing the user goal, every model call, every tool call (structured request and structured response), planner outputs, retries, and side effects. Sessions and multi-agent graphs are first-class data model concepts, not derived views. Evaluators are tagged by trajectory dimension (goal completion, tool-use correctness, context retention, reasoning quality) and run against the full trajectory. Dashboards aggregate by trajectory and by per-dimension score, not by row. The CI integration uploads a dataset of trajectories and gates on trajectory-level dimensional scores.

The same primitives appear in both categories: a managed evaluator, a versioned rubric, a ground-truth dataset, a 0 to 1 score. What changes is the structure of the data they operate on and the contract the platform offers to the system under evaluation.

When this matters

The agent-first versus LLM-first distinction becomes decisive when at least one of these holds:

  • Tool calls. Any system that invokes external tools needs structured request and response capture; flat row tables flatten this away.
  • Multi-turn state. Sessions where context from turn two affects behaviour at turn fifteen cannot be evaluated by row.
  • Goal-level success criteria. When "did the user's goal get accomplished" is a question only the trajectory can answer.
  • Multi-agent graphs. When evaluating handoff fidelity between cooperating agents requires per-edge scoring on the agent graph.
  • Tool-call regression detection. When a regression in tool-call correctness is the most common failure mode and you need to gate on it.

If none of those hold, the LLM-first stance is the right tool, and an agent-first platform will feel heavy and over-engineered. If even one holds, an LLM-first platform tends to require enough application-side glue code that the team ends up rebuilding trajectory primitives outside the platform.

How it works

LLM-first architecture

The native data model is a row:

. Ingestion is row-shaped: upload a CSV, log a single call from the SDK, run a batch experiment over a static dataset. Evaluators are functions of one row at a time: deterministic checks (regex, schema, exact match), LLM-as-judge calls with the row as input, embedding similarity against a reference. Dashboards aggregate per-evaluator pass rate over a dataset run. The CI gate uploads a dataset version, runs the evaluator panel, and reports a percentage pass rate per evaluator.

Failure modes specific to this stance:

  • Multi-turn sessions get flattened into independent rows, hiding context-loss failures.
  • Tool-call sequences appear as opaque text inside the output column, hiding tool-misuse failures.
  • Goal completion has no native representation; teams reconstruct it by concatenating rows and running a custom judge over the concatenation, often outside the platform's CI integration.
  • Multi-agent handoff fidelity cannot be expressed at all in the native data model.

Agent-first architecture

The native data model is a trajectory: a tree of spans (often OpenTelemetry-shaped) with explicit roots for the user goal, branches for tool calls (structured request and structured response), and leaves for individual model calls and side effects. Ingestion is trajectory-shaped: an SDK or OpenTelemetry exporter captures the full tree at runtime; offline trajectories can be replayed from logs. Evaluators are tagged by dimension and run against the whole trajectory: goal completion (does the final state match the user intent), tool-use correctness (did the right tool get called with the right arguments), context retention (were earlier constraints honoured), reasoning quality (did each step change the visible task state). Dashboards aggregate by trajectory and per-dimension score; drill-down moves from "overall quality dropped" to "which dimension drove it" to "which span in the trajectory caused it." The CI gate uploads a dataset of full trajectories and gates on per-dimension floors, not just an aggregate.

Failure modes specific to this stance:

  • The richer ingestion contract is more work to wire up; teams that have no trajectory shape pay a setup cost for nothing.
  • Per-trajectory scoring on a frontier judge is more expensive than per-row scoring; sampling and tiered evaluation become operationally important.
  • Dashboards have more dimensions than the LLM-first equivalent and can feel overwhelming on small workloads.

Example

A support agent for an e-commerce platform. Each user request resolves over an average of five model calls and three tool calls (CRM lookup, order status, returns API). Two evaluation stances on the same workload:

LLM-first. Each model call is logged as a row. The evaluator panel scores each row for instruction following, helpfulness, and toxicity. The dataset run reports 0.93 pass rate. Production users report that ten percent of refund requests fail silently. The evaluator panel has no concept of "did the refund get processed"; the failing trajectories all pass at the row level because each individual model response is fluent and on-topic. The team writes a custom judge that concatenates trajectory transcripts and scores goal completion outside the platform, then wires it into CI as a separate job.

Agent-first. Each user session is logged as a trajectory tree: user goal, five model calls, three tool calls (with structured request and response), planner outputs. The evaluator panel scores each trajectory on goal completion, tool-use correctness, context retention, and reasoning quality. The dataset run reports 0.93 on instruction-level metrics but 0.78 on goal completion. Drill-down shows the gap is concentrated on refund requests; per-trajectory inspection traces the failure to a CRM tool call returning an empty body that the planner treated as success. The platform's native data model surfaces the failure mode directly; CI gates on the per-dimension floor for goal completion catch the regression on the next deploy.

The objective in both cases is the same: prevent silent refund failures. The implementation surface that the platform exposes is what determines whether the team can express the objective natively or has to build glue around the platform.

Comparison

A criteria-driven view, with the wins distributed across both stances:

CriterionLLM-firstAgent-first
Native unit of evaluationSingle prompt-response row.Full trajectory of spans.
Setup cost on single-call workLow. Upload a CSV, score, done.Higher. Trajectory ingestion is overkill for a flat dataset.
Setup cost on multi-turn workHigh in practice. Trajectory logic gets rebuilt outside the platform.Native. The data model fits the workload.
Tool-call evaluationLoses. Tool calls appear as opaque text in the output column.Wins. Structured request and response are first-class.
Goal-completion scoringPossible only through application-side concatenation.Native dimension.
Multi-agent handoff fidelityCannot be expressed in the native data model.Native: per-edge scoring on the agent graph.
Prompt iteration UXWins. The comparison view is the headline feature.Partial. Trajectory-shaped comparisons are richer but slower to skim.
Cost per scored unitWins. One judge call per row.Loses. Trajectory-level judges and per-dimension scorers cost more.
CI gate granularityRow-level pass rate.Per-dimension floors and per-trajectory aggregates.
Drill-down on regressionsLimited. Per-row scores plus dataset diff.Wins. Per-dimension to per-span trace.
Best fit workloadsClassification, extraction, summarisation, prompt iteration.Tool-using agents, multi-turn assistants, multi-agent systems, goal-driven flows.
Composability of evaluatorsFunctions of one row.Functions of a trajectory, with per-dimension composition.

The pattern: LLM-first wins on lightweight single-call workflows where the prompt-response row is the right unit, and agent-first wins everywhere the workload is trajectory-structured and the team needs the platform's native data model to fit the system under evaluation.

Who should not use a hosted eval-first platform

  • Teams with trajectory-shaped workloads (tool calls, multi-turn state, goal-completion criteria) whose hosted vendor only offers an LLM-first flat-row data model: the application-side glue grows into the most fragile part of the stack.
  • Teams whose CI gates need per-dimension floors rather than a single per-row pass rate: the LLM-first surface cannot natively express the gate.
  • Teams whose audit and lineage requirements include "which trajectory, scored on which dimension, was blocked" rather than "which row failed."

Where each category is stronger

LLM-first is the stronger fit when the workload is single-call (classification, extraction, summarisation, embedding-based retrieval scoring), when prompt iteration on a static dataset is the dominant developer activity, when there are no tool calls or tool calls are rare enough that ad-hoc handling is fine, and when the platform's CI integration only needs to gate on per-row evaluator pass rates.

Agent-first is the stronger fit when the workload includes tool calls whose structured arguments and responses must be scored, when multi-turn sessions are the norm and context loss is a real failure mode, when goal completion is the success criterion users actually care about and per-row scores hide it, when multi-agent graphs require per-edge handoff scoring, and when CI gates need per-dimension floors (tool-use, context retention, goal completion), not just one aggregate.

Limitations

  • The categories are stances, not strict typings. Some platforms straddle both stances and let teams pick the unit per workload. The architectural choice still shows up in defaults and in where the platform feels natural versus forced.
  • Agent-first cost compounds. Trajectory-level scoring on a frontier judge per session is expensive at production scale. Sampling, tiered evaluation, and cheap deterministic pre-filters are operationally mandatory.
  • LLM-first glue code is invisible. When teams build trajectory logic outside an LLM-first platform, the resulting system can look like "the platform plus a small custom layer," but the small custom layer often grows into the most fragile part of the stack.
  • Single-call evaluation does not disappear in an agent-first stance. Per-span scoring (one model call at a time) is still useful; agent-first platforms include it as a degenerate case of trajectory scoring.
  • Migration is non-trivial. Moving an agent workload from an LLM-first to an agent-first platform requires re-instrumenting the application to emit trajectory-shaped traces; existing flat-row datasets do not back-fill cleanly.

Evidence and sources

Numeric claims (cost ratios, regression detection rates) vary by workload and are not separately re-verified for this post; measure on your own traffic before using them in planning.

FAQ

Is one stance objectively better? No. Each is optimal for a different workload type. The wrong stance for your workload is what causes pain; the right stance for the other workload is not relevant to your decision.

Can I run an LLM-first platform on agent workloads? Yes, by building application-side glue that synthesises trajectory metrics outside the platform's native data model. The glue tends to become the most fragile part of the stack as the agent grows.

Can I run an agent-first platform on single-call workloads? Yes, but it will feel heavy. Single-call workloads are a degenerate case of trajectory scoring; the richer ingestion contract is pure cost on workloads that do not need it.

Where do hybrid platforms sit? On a spectrum. The architectural stance shows up in defaults, dashboards, and what "the score" means without explicit configuration. Hybrid platforms that lean LLM-first feel like spreadsheets on agent workloads; hybrid platforms that lean agent-first feel heavy on flat datasets.

Does my choice change if I move from prototyping to production? Often, yes. Prototyping favours LLM-first because the dataset-iteration loop is fast. Production agent workloads favour agent-first because the failure modes that matter at scale (tool misuse, context loss, goal drift) need trajectory-shaped data to detect.

Related reading