Updated: 2026-03-25 By: Ari Heljakka
Short answer
Prompt-centric platforms organize work around the prompt: edit it, branch it, deploy it, compare versions. Eval-centric platforms organize work around the score: define an objective, calibrate a judge against ground truth, gate deployments on a scorecard. The decision is not about which is "better" but about which workflow needs to be one click and which can tolerate being a configuration migration. Most teams want both, but the platform that sits at the center of the workflow shapes how the team thinks about quality.
Key facts
- Definition: Prompt-centric platforms treat the prompt as the central, deployable artifact with versioning and environments. Eval-centric platforms treat the objective and its evaluators as the central, versioned artifacts that score any implementation.
- When to use: Prompt-centric for fast prompt iteration, non-engineer authorship, and lightweight evaluation. Eval-centric for deployment gates, audit lineage, per-dimension drift detection, and continuous judge calibration.
- Limitations: Prompt-centric platforms usually treat evaluators as inline configuration without rubric versioning. Eval-centric platforms tend to have lighter prompt editing UIs and may require external prompt registries.
- Example: A team operating in a regulated industry uses a prompt-centric workflow for prompt drafting and an eval-centric system for the CI gate and audit trail.
Key takeaways
- Both categories share many surface features but differ on which artifact is the system of record.
- Prompt-centric platforms optimize for prompt iteration ergonomics.
- Eval-centric platforms optimize for versioned objectives, calibrated judges, and CI gating.
- The decision is shaped by what the team needs to prove, not just what they need to edit.
- Composing the two is the common pattern: a prompt-centric authoring surface paired with an eval-centric gate.
Definition
A prompt-centric platform organizes its data model and primary UI around the prompt. Prompts have IDs, versions, branches, environment bindings, and deployment workflows that resemble code release pipelines. Datasets, evaluators, and scores exist primarily to support prompt iteration: comparing version N+1 to version N, picking a winner, deploying it. The platform's center of gravity is "what is the current prompt running, who edited it, and how do I roll it back."
An eval-centric platform organizes its data model and primary UI around the score. Each objective is a versioned rubric backed by a ground truth dataset. Each evaluator (LLM judge, rule, classical metric) is a pinned implementation: model, prompt, threshold, version. Implementations of the objective (prompts, agent configurations, retrieval pipelines) are scored uniformly against the objective catalogue. The platform's center of gravity is "did the implementation meet the bar across the dimensions we care about, and which versioned evaluator produced the score that gated the last deployment."
The choice changes the unit of work. On a prompt-centric platform, the unit of work is the prompt version. On an eval-centric platform, the unit of work is the scored sample against a versioned objective.
When this matters
The architectural distinction starts to dominate when one or more of these conditions holds:
- Non-engineer authorship. When domain experts write and own prompts, a prompt-centric UI with diffs, comments, and approvals is the right tool. The eval system runs in support.
- CI gating on a scorecard. When every deployment must clear a multi-dimensional scorecard with explicit thresholds, an eval-centric system holds those gates as native primitives.
- Audit-grade lineage. When the question "which versioned objective, judged by which evaluator version, against which dataset version, produced the score that gated this output" must be answerable, an eval-centric system models that lineage as a first-class concern.
- Per-dimension drift. When toxicity, factuality, and instruction following must each have their own threshold and alert, slicing dashboards by prompt version is not enough; you need to slice by objective and dimension.
- Continuous calibration. When judge agreement against human-labeled ground truth must be tracked and recalibrated on a cadence, the evaluator needs to be a managed component with its own version history.
- Multiple implementations per objective. When the same objective is enforced by rules in front, an LLM judge behind, and human review at the edge, the evaluator catalogue must be decoupled from any one prompt. That decoupling is native to eval-centric systems.
If the dominant work is "ship a new prompt today, safely," prompt-centric wins. If the dominant work is "prove the bar holds across changes," eval-centric wins.
How it works
Prompt-centric
A typical pipeline:
- Prompt registry. Prompts have IDs, versions, branches, and environment bindings.
- Editor and collaboration. A web UI lets non-engineers edit prompts, leave comments, and request review. Version diffs are first-class.
- Deployment. Promoting a prompt is a versioned action with audit log and rollback.
- Inline evaluation. Evaluators score outputs against test inputs. Comparisons across prompt versions are the primary surface. Evaluator definitions are typically configuration on the prompt or dataset, not standalone versioned objects.
- Production telemetry. Live calls are logged and can be re-scored against candidate prompts.
The center of gravity is the prompt. Deployment ergonomics are mature; evaluation infrastructure is functional but secondary.
Eval-centric
A typical pipeline:
- Objectives. Versioned rubrics with ground truth datasets. Each objective is independent of any specific implementation.
- Managed evaluators. Each objective has one or more evaluators (LLM judge, rule, classical). Each evaluator is pinned: model, prompt, threshold, version.
- Scorecards. Scored samples carry explicit lineage to objective version, evaluator version, and dataset version.
- Gates and alerts. CI deployments are gated by scorecard on a held-out set. Drift on any dimension triggers an alert tied to the specific objective version.
- Calibration loop. Judge agreement with human-labeled ground truth is tracked over time. Recalibration is triggered when agreement drops below threshold.
- Prompt iteration as input. Prompts (held in an external registry or version control) are scored against the objective catalogue before promotion.
The center of gravity is the objective. Audit and gating are mature; prompt editing is usually lighter.
Where they overlap
Both can edit prompts, both can hold datasets, both can run evaluators. The difference is which workflow is one click. A prompt-centric platform makes "ship a new prompt version" trivial. An eval-centric platform makes "ship a new rubric version and retroactively rescore" trivial.
Example
A team building an AI assistant for healthcare summarization:
- Prompt-centric slice. Clinical experts draft prompts in a hosted UI with diff view, comments, and an approval workflow. Each prompt version has a one-click rollback. Non-engineer ownership of prompt content is the headline feature.
- Eval-centric slice. A separate system holds versioned rubrics for "summary preserves all medication dosages," "summary flags ambiguity rather than fabricating values," "summary respects patient confidentiality." Each rubric has a ground truth dataset built from labeled clinical cases. Managed LLM judges score outputs against each rubric; judge agreement with clinical reviewers is tracked monthly. A CI gate blocks any prompt promotion that fails to clear the scorecard on a held-out set.
- Where each pays off. The prompt-centric surface makes editing fast and safe for non-engineer authors. The eval-centric surface makes the deployment gate explicit, with audit-grade lineage from any decision back to the rubric version and evaluator version that produced its score.
The two meet at the CI step: a candidate prompt from the registry is scored against the objective catalogue before promotion.
Comparison
A category-level view, with the wins distributed across both:
| Criterion | Prompt-centric | Eval-centric |
|---|---|---|
| Unit of work | The versioned prompt. | The scored sample against a versioned objective. |
| Source of truth | Which prompt is running where. | Whether the implementation meets the success criteria. |
| Prompt editing UX | Native, often the headline feature. | Lighter; relies on external registry or VCS. |
| Non-engineer authorship | Native: branches, comments, approvals. | Possible, usually less polished. |
| Rollback | One click, audit-logged. | Via version control on the prompt artifact. |
| Rubric versioning | Often inline configuration. | First-class versioned artifact. |
| Judge versioning | One evaluator definition per name. | Pinned model, prompt, threshold; each version queryable. |
| CI gating on scorecard | Possible with glue. | Native primitive. |
| Per-dimension drift alerts | Slice by prompt version. | Slice by objective and dimension. |
| Calibration tracking | Limited. | Judge agreement against ground truth is a tracked metric. |
| Audit lineage | Prompt version plus environment. | Objective version + evaluator version + dataset version + score. |
| Multi-implementation support | Tied to prompt iteration. | Scores rules, LLM judges, human review uniformly. |
| Score composability | Per-prompt-version metric. | 0 to 1 across orthogonal dimensions; weighted aggregates. |
The pattern: prompt-centric wins on edit speed, non-engineer collaboration, and rollout ergonomics. Eval-centric wins on rubric and judge versioning, CI gating, audit lineage, per-dimension drift detection, and multi-implementation support.
Prompt-centric plays well when
- Non-engineers own and edit prompts frequently.
- The dominant operational concern is "ship the next prompt safely."
- Evaluation is a comparison tool ("v14 vs v13"), not a deployment gate.
- The team accepts inline rubric configuration without first-class versioning.
Eval-centric plays well when
- Deployments must clear a multi-dimensional scorecard with audit-grade lineage.
- Per-dimension drift is a first-class operational signal.
- The same objective is enforced by multiple implementations and must remain stable across model swaps.
- Judge calibration against human-labeled ground truth is a recurring operational task.
- Compliance, audit, or regulated surfaces force explicit lineage from decision to rubric version.
Three questions that resolve the choice
- Who edits prompts most often? If the answer is non-engineers and the cadence is daily, the prompt-centric surface earns its keep. If the answer is engineers and the cadence is weekly, an external prompt registry plus an eval-centric platform is often enough.
- What blocks a deployment? If the answer is a peer review and a quick sanity check, prompt-centric ergonomics dominate. If the answer is a multi-dimensional scorecard with versioned thresholds, an eval-centric gate dominates.
- What does an auditor ask for? If the answer is "show me the diff and the deploy log," prompt-centric is sufficient. If the answer is "show me the score, the rubric version, the evaluator version, the dataset version, and the agreement metric at time of decision," eval-centric is required.
In practice, most production stacks compose the two: a prompt-centric authoring surface for non-engineer collaborators, paired with an eval-centric system for gates, calibration, and audit.
Limitations
Both categories have real soft spots:
- Prompt-centric platforms hide rubric drift. When rubrics live as inline evaluator config, changing them is easy, and the effect on historical scores is easy to miss. The bar can move invisibly.
- Prompt-centric gates are coarse. "v14 beats v13 on this dataset" is useful but does not enforce a scorecard across orthogonal dimensions with versioned thresholds.
- Eval-centric editing is lighter. Teams that need non-engineer prompt authoring usually pair with an external prompt registry or code-based workflow.
- Both depend on the ground truth dataset. A stale dataset means scoring against a stale bar; dataset versioning and refresh cadence are operational concerns in either architecture.
- Coupling objective to implementation is easy to do by accident. A score named "factuality" produced by one specific judge model is not the same artifact as a versioned
objective. The naming hides the coupling until the model changes. - Two sources of truth invite confusion. If a prompt-centric UI shows one score and an eval-centric system shows another for the same input, the team needs a clear rule for which one is authoritative.
Evidence and sources
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," Zheng et al., 2023, https://arxiv.org/abs/2306.05685, for the foundational case that judge quality must itself be measured.
- "Holistic Evaluation of Language Models," Liang et al., 2022, https://arxiv.org/abs/2211.09110, for the dimensional decomposition pattern underlying multi-objective scorecards.
- OpenTelemetry GenAI semantic conventions, https://opentelemetry.io/docs/specs/semconv/gen-ai/, for the trace shape that both categories consume.
FAQ
Is one architecture replacing the other? Neither, in mature stacks. Prompt-centric platforms keep winning on edit ergonomics; eval-centric platforms keep winning on gates, calibration, and audit. The composition is the common pattern.
Can I build my own eval-centric layer on top of a prompt-centric platform? You can. The work is real: rubric versioning as first-class data, evaluator pinning (model, prompt, threshold), dataset versioning, CI gates that consume a scorecard, and per-dimension drift alerts. If most of that is bespoke glue, you have effectively rebuilt an eval-centric platform.
Does eval-centric assume LLM judges? No. The evaluator catalogue holds rules, classical metrics, and human review alongside LLM judges. The point is that each evaluator is a pinned, versioned implementation of an objective.
How do I avoid two sources of truth? Pick which platform owns the deployment gate. Treat scores from the other system as informational. Make the authoritative scorecard the one that blocks the deploy, and make sure every other dashboard derives from it.
Does this discussion change for agents vs single-turn prompts? The category distinction is the same. What changes is the granularity: eval-centric systems for agents often score at the trajectory level (final outcome and intermediate steps as orthogonal dimensions), while prompt-centric systems for agents tend to version the system prompt or the policy file.
Is this just a question of UI, or does it really change behavior? It changes behavior. Teams ship prompts in line with the easy path. If the easy path on the platform is "edit and deploy," prompts ship faster than gates can catch issues. If the easy path is "score against the catalogue and promote if green," gates win. The architecture biases the workflow.