How do you use LLM-as-judge for model A/B testing and selection?

How do you use LLM-as-judge for model A/B testing and selection?

Updated: 2026-04-13 By: Ari Heljakka

Short answer

This is about choosing between models for a specific task, not benchmarking "the whole model" across every kind of thing it might do. Public leaderboards answer "which model is generally smarter"; they almost never answer "which model is better for my chatbot, my extraction pipeline, my support agent." Scope the question to your task and your inputs, or the verdict will not transfer.

Within that scope, an LLM-as-judge scores each candidate's outputs against an explicit rubric and returns a structured per-item verdict. There are two comparison modes. Scored comparison runs each candidate independently against the rubric and assigns each a 0 to 1 score per dimension; it is the most flexible to maintain over time because adding a third candidate, swapping in a new checkpoint, or re-running last quarter's comparison costs nothing extra, each candidate is just scored against the same fixed rubric. Pairwise comparison judges two candidates head-to-head on each item and is useful when the differences are subtle enough that relative judgement is sharper than absolute scoring, at the cost of re-running comparisons as candidates change. Default to scored comparison and reach for pairwise when you need the extra discrimination. Either way, the aggregate verdict is only as trustworthy as the rubric, the judge's calibration against human labels, and the dataset the comparison runs on: treat the judge as a managed, versioned component with its own agreement metric, and the rubric as a versioned artifact independent of any candidate model. The verdict then becomes a reproducible regression decision, not a one-off vibe check.

Key facts

  • Definition: Using an LLM-as-judge for model A/B testing and selection is the practice of using one LLM to score outputs from candidate models against a written rubric (scored per-candidate, or pairwise head-to-head), producing structured verdicts that aggregate into a per-dimension comparison for a specific task.
  • When to use: Selecting between model versions for a task, comparing prompt variants on the same model, regression-gating a model swap, A/B testing a new fine-tune against the baseline. Not for ranking a model's general capability; that is what public benchmarks are for.
  • Limitations: Judges carry position bias, verbosity bias, and self-preference bias; without calibration against human labels and a versioned rubric, verdicts drift silently. Pairwise also re-runs as candidates change, where scored comparison does not.
  • Example: Two candidate model versions are scored on 400 representative prompts; a neutral judge model returns per-item verdicts on accuracy, comprehensiveness, clarity, and best practices; the aggregate decides which version ships, gated on per-dimension floors.

Key takeaways

  • Scope the comparison to a task. This is task-specific A/B testing and selection, not benchmarking the whole model. A model that wins your task can lose a public leaderboard and vice versa; only the task-scoped verdict transfers to your product.
  • Model-vs-model comparison is a measurement problem. The rubric is the artifact; the candidate models are interchangeable implementations of the same objective.
  • Scored comparison is the most maintainable default: each candidate is scored independently against a fixed rubric, so new candidates and re-runs cost nothing extra. Reach for pairwise when differences are subtle enough that relative judgement is sharper, accepting that pairwise re-runs as candidates change.
  • The judge needs a different model than either candidate. Self-preference bias is real; a candidate judging itself inflates its own scores.
  • Structured JSON verdicts (per-dimension score plus justification) are the unit of analysis. Free-text rationales without scores cannot be aggregated.
  • Judge agreement against human labels is the meta-metric. Without it, the comparison is unfalsifiable.

Definition

LLM-as-judge for model evaluation is the practice of using a neutral LLM to score outputs from two or more candidate models on the same inputs, against the same rubric, producing structured verdicts that aggregate into a per-dimension comparison. Three pieces matter:

  • The candidates. The models being compared (model A and model B, or version v3.0 and v3.1, or two prompt variants on the same model).
  • The rubric. An explicit written specification of what counts as a good answer, decomposed into orthogonal dimensions, with anchor examples per dimension.
  • The judge. A different LLM with no skin in the game, prompted to score each candidate's output against the rubric and return a structured verdict.

The rubric is the objective; the candidates are implementations; the judge is the measurement instrument. Each piece is independently versioned. The aggregate result is "the rubric, scored by judge version X on dataset version Y, says candidate B beats candidate A on dimensions 1 and 3 with these per-dimension floors."

When this matters

LLM-as-judge for model evaluation is critical when at least one of these holds:

  • Model selection. Choosing between candidate models (open versus closed, frontier versus small, base versus fine-tuned) for a specific task. Public benchmarks rarely transfer; only a task-specific harness with a calibrated judge gives a verdict you can ship on.
  • Model swap regression. When a model provider ships a new checkpoint, comparing old versus new on your task is the gate that prevents silent regressions.
  • Fine-tune validation. Validating that a fine-tune actually improves on the baseline, dimension by dimension, against the same rubric.
  • Prompt A/B testing. Comparing prompt variants on the same model where the difference is too subtle for spot checks but matters in aggregate.
  • Vendor evaluation under controlled conditions. Running multiple providers against the same dataset and rubric, with judge-scored per-dimension verdicts, for a procurement decision.

For "which model is better on my workload," scored comparison (each candidate scored independently against the rubric) is the maintainable default and the easiest to extend as candidates come and go. Pairwise comparison earns its keep when the candidates are close enough that relative head-to-head judgement discriminates better than absolute scores. The platform's scored-comparison workflow (run candidates through the same evaluators and compare per-dimension scores, latency, and cost) is documented in Find the best prompt and model.

How it works

A working model-vs-model LLM-as-judge evaluation has five stages.

Stage 1, set the comparison dataset

The dataset is a representative sample of the inputs the production system will see. Size guidance: a few hundred items is enough for a focused task; expand toward a couple of thousand as the comparison matures. Stratify across slices (common, edge, adversarial; per-surface; per-customer-segment) so per-slice verdicts are reportable, not just aggregate.

The dataset is a versioned artifact. Tag every comparison run with the dataset version it ran against; without that tag, comparisons across time are not meaningfully comparable.

Stage 2, write the rubric and pick dimensions

Decompose what "good" means into orthogonal dimensions. For a model comparison, common dimensions are:

  • Accuracy against a known answer (where applicable).
  • Comprehensiveness (does the answer cover what was asked).
  • Clarity (is the answer readable and well-organised).
  • Best-practice adherence (for technical or domain tasks, does the answer follow the conventions of the domain).
  • Safety (where relevant).
  • Format compliance (where structured output is required).

Each dimension gets a short rubric with anchor examples. The rubric is versioned alongside the dataset; rubric changes invalidate prior aggregate scores.

Stage 3, generate candidate outputs

Each candidate model generates an output for every dataset item under identical conditions: identical prompts, identical system instructions, identical temperature and sampling parameters (or both set to deterministic for reproducibility). Pin the candidate model versions explicitly; "latest stable" of any provider is not a candidate identity.

Stage 4, design the judge

The judge is a neutral LLM, different from either candidate. The judge prompt:

  • Includes the rubric in full, with anchor examples per dimension.
  • Presents both candidate outputs labeled neutrally (
    and
    , not "the new model" and "the old model"; randomise the assignment per item to defeat position bias).
  • Asks for a structured JSON verdict: per-dimension score (0 to 1) for each candidate, an overall winner (or tie), and a written justification per dimension.
  • Specifies the schema explicitly so output parsing does not introduce its own failure mode.

The judge runs against every item. Position bias is mitigated by randomising the A/B assignment per item; verbosity bias by including a per-dimension rubric anchor that penalises padding; self-preference bias by ensuring the judge is a different model from both candidates.

Stage 5, calibrate against human labels

Before the judge's verdicts gate anything, calibrate. Sample a slice of the dataset (commonly 10 to 20 percent) and have humans score the same items against the same rubric. Compute per-dimension agreement between judge and humans (Cohen's Kappa, Fleiss' Kappa, or weighted accuracy depending on the scale). Set an agreement floor per dimension (commonly 0.7 to 0.85). Re-prompt or re-train the judge until it clears the floor on the calibration slice.

This is the only check that turns the judge from "opinion at scale" into "measurement." Without it, the judge's verdicts are unfalsifiable; with it, the judge becomes a managed component whose own quality can be tracked over time.

Stage 6, aggregate and gate

Aggregate per-dimension scores across the dataset. Report:

  • Per-dimension win rate of B vs A.
  • Per-dimension mean score for each candidate.
  • Per-slice breakdown (common, edge, adversarial).
  • Overall winner with per-dimension floors.

The deployment decision is multi-criterion: B ships if it wins on enough dimensions and clears the per-dimension floor on every dimension that has one. A B that wins the aggregate but loses safety is not a candidate; per-dimension floors prevent the aggregate from hiding the regression.

Example

Two candidate models for a cloud-architecture assistant: a current production model (call it

) and a candidate (
). Both pinned to specific versions. The comparison:

  • Dataset. 412 representative cloud-architecture prompts drawn from production logs, stratified into common (security groups versus network ACLs), edge (multi-region failover), and adversarial (deliberately ambiguous compliance scenarios).
  • Rubric. Four dimensions, each with anchor examples: accuracy, comprehensiveness, clarity, best-practice adherence (security and cost conventions).
  • Candidates generate outputs. Both run with identical system instructions, temperature 0.2, top-p 0.95, seed pinned.
  • Judge. A neutral third LLM (a different vendor and architecture from both candidates) prompted with the full rubric. A/B assignment randomised per item. Structured JSON output with per-dimension scores and justifications.
  • Calibration. 60 items (a 15 percent slice) are independently scored by three human cloud architects against the same rubric. Per-dimension Cohen's Kappa between judge and humans: accuracy 0.84, comprehensiveness 0.79, clarity 0.81, best-practice 0.76. All clear the 0.7 floor.
  • Aggregate.
    wins on accuracy (+8 points), comprehensiveness (+4 points), best-practice (+11 points); loses on clarity (-2 points). Per-dimension floors are set as 0.80 accuracy, 0.75 comprehensiveness, 0.70 clarity, 0.75 best-practice. The candidate clears every floor.

Decision:

ships, with the clarity regression flagged as a tracked metric and a follow-up prompt-engineering pass scheduled. The dataset, rubric version, judge version, and per-dimension agreement metrics are recorded with the deployment; the same evaluation can be re-run when the next candidate appears.

Limitations

  • Position bias. Judges over-prefer the first or second option presented depending on the model. Mitigate by randomising A/B assignment per item.
  • Verbosity bias. Judges over-prefer longer, more confident-sounding answers regardless of quality. Mitigate by including a per-dimension rubric anchor that penalises padding and by having raters cross-check on a calibration slice.
  • Self-preference bias. A candidate judging itself or judging an output from the same family inflates its own scores. Use a judge from a different model family than either candidate.
  • Style preference bias. Judges trained on certain formatting conventions over-prefer outputs in that format even when the content is equivalent. Sample-check by rendering outputs in multiple formats during calibration.
  • Judge drift. Provider updates to the judge model change its scoring distribution. Recalibrate against the human-labeled slice whenever the judge model changes; track per-dimension agreement over time as the meta-metric.
  • Calibration cost. The human-labeled calibration slice is real work. Plan for it as a recurring operational cost, not a one-time tuning phase.
  • Pairwise scales quadratically with N candidates. For three or more candidates, prefer round-robin pairwise or pointwise scoring against the rubric rather than full all-pairs.

Evidence and sources

Numeric figures in this post (Kappa thresholds, dataset sizes, sampling rates) are illustrative; calibrate against your own workload and judge before using them.

FAQ

Scored (pointwise) or pairwise? Default to scored: each candidate is scored independently against the rubric, which makes the comparison the easiest to maintain over time (a new candidate or a re-run is just another scoring pass against the same fixed rubric, with nothing to re-pair). It is also what regression gating against a fixed bar requires. Use pairwise when the candidates are close enough that relative head-to-head judgement is sharper than absolute scores, accepting that pairwise comparisons re-run as the candidate set changes. The scored-comparison workflow is documented in Find the best prompt and model.

Is this for benchmarking the whole model? No. This is task-specific A/B testing and selection: "which candidate is better for my task, on my inputs, against my rubric." Benchmarking a model's general capability across many tasks is what public leaderboards are for, and those verdicts rarely transfer to a specific product. Scope the comparison to the task you actually ship.

Can the judge be the same model as a candidate? No. Self-preference bias inflates the candidate's scores. Use a judge from a different model family.

How big should the comparison dataset be? Start with the smallest stratified set that gives stable per-dimension verdicts (often a few hundred items). Expand as the comparison matures and as edge slices become more important.

What if the judge disagrees with the humans on calibration? Re-prompt or replace the judge until agreement clears the per-dimension floor. If no judge clears the floor, the rubric is probably ambiguous; tighten it.

How often should I recalibrate the judge? Whenever the judge model version changes, whenever the rubric changes, and on a fixed cadence (monthly or quarterly) against a refreshed human-labeled slice. Treat judge drift as a first-class operational signal.

Does this work for prompt comparisons too? Yes. The same pipeline scores prompt variant A versus prompt variant B on the same model, with the prompt as the implementation and the rubric as the objective.

Related reading