Updated: 2026-04-12 By: Ari Heljakka
Short answer
Treat LLM evaluation the way you treat unit tests: as first-class CI infrastructure that gates every change to prompts, models, datasets, and tools. The gate is layered (deterministic checks, a managed LLM judge, sampled human review) and each layer is independently scored against a versioned ground-truth dataset. For a metric judge the gate has a simple form: accept only if score > threshold (per dimension, against a floor the team agreed on). The catch is that an LLM judge's score is not deterministic, so a single run can land on either side of the threshold by chance. To make the gate stable you must account for that variance: run enough repetitions in parallel and gate on the aggregate (mean, or a lower confidence bound) rather than a single sample, so a pass or fail reflects the model's real quality and not scoring noise. Pull requests run a fast subset; merges run the full suite; production rollouts are guarded by canary metrics that fail open back to the prior version when a quality dimension drifts below its floor. LLMs do not break, they drift, and the only way to catch drift is a gate that keeps running after deployment.
Key facts
- Definition: A CI/CD evaluation gate is a versioned, automated check that scores an LLM application against measurable objectives and blocks promotion when any dimension regresses.
- When to use: Any prompt, model, RAG pipeline, agent, or tool change shipped to production users. Any model upgrade. Any change to the ground-truth dataset itself.
- Limitations: Evaluation is only as good as the dataset. Static datasets go stale. LLM judges drift when the judge's own model changes. A gate that never fails is not a gate.
- Example: A pull request that edits a prompt triggers a fast suite (deterministic checks plus a small managed judge run) in under three minutes. Merge to main triggers the full ground-truth pass with per-dimension scores. Canary deployment routes one percent of traffic, scores it live, and rolls back automatically if any dimension drops below its floor.
Key takeaways
- Evaluation is not a phase; it is continuous infrastructure that runs at PR time, merge time, deploy time, and in production.
- The gate must be layered: deterministic checks, managed LLM judge, sampled human review. No single layer is trustworthy alone.
- Scores are normalized to 0 to 1 per dimension. Composition is a weighted rule the team agreed on, recorded in the rubric.
- Datasets, rubrics, prompts, and judges are all versioned artifacts. Reproducibility requires pinning every input to a release.
- Quality drift is a first-class operational signal with its own alerts and runbooks, not a quarterly review.
Definition
A CI/CD evaluation gate is the automated checkpoint that decides whether a change to an LLM application is allowed to proceed: into a branch, into main, into a canary, or into full production traffic. It has the same shape as a unit test gate, with three differences:
- Outputs are non-deterministic. The same input can produce different outputs run to run, so the gate scores distributions, not single values.
- The gate is itself a model. A managed LLM judge scores subjective dimensions. The judge has to be calibrated against ground truth and re-calibrated when its own model changes.
- Failure is graded, not binary. A unit test passes or fails. An eval gate scores each dimension on a 0 to 1 scale and applies a per-dimension floor and a composite rule.
The evaluation suite, the ground-truth dataset, and the judge rubric are versioned alongside the application code. A change to any of them is a change that requires its own pull request and its own review.
When this matters
- Any prompt, model, or RAG change. The change set most LLM teams ship is not Python code; it is prompts, retrieval indices, and model versions. Each of those needs the same gate.
- Model upgrades. A model swap can silently improve some dimensions and silently regress others. Without a gate, the regression ships invisibly.
- Multi-author teams. When several engineers ship prompt changes a week, the gate is the only thing keeping the system coherent.
- Regulated or high-stakes use cases. Medical, legal, financial, and safety surfaces need an evidence trail showing that every shipped version cleared the same bar.
- Long-lived products. Drift compounds. A product that ran clean at launch can be quietly miscalibrated six months later without a gate that keeps running in production.
How it works
A serious CI/CD evaluation stack has four loops, each running at a different cadence on a different slice of the change set.
Loop 1: Pull request, fast suite
Fires on every PR that touches prompts, models, datasets, or evaluator config. Optimized for speed and signal density, not coverage.
- Deterministic checks. Schema validation, regex over outputs, length caps, required tokens, disallowed tokens, language detection. Runs in seconds.
- Small managed judge run. A subset of the ground-truth dataset (the boundary slice, deliberately biased toward known failure modes) is scored by the managed judge. Each rubric dimension is reported with its delta from main.
- Per-dimension floor check. Any dimension that drops below its floor fails the PR. The PR comment includes the failing examples.
Target latency: under three minutes. Anything slower and engineers route around it.
Loop 2: Merge to main, full suite
Fires on merge or pre-merge. Optimized for coverage.
- Full ground-truth pass. The complete dataset is scored on every rubric dimension.
- Composite gate. Per-dimension floors plus the composite rule. Failure blocks the merge.
- Judge agreement check. The judge's own agreement with the labeled set is re-measured. If agreement has drifted, the judge is recalibrated before the suite is trusted.
- Artifact publication. Scored outputs, judge versions, dataset versions, and prompt versions are stored as a reproducible release bundle.
Target latency: ten to thirty minutes. Acceptable because it runs once per merge, not per push.
The "complete dataset" can be small, especially at the start. A judge run against even 10 examples is better than shipping on vibes, and with a strong judge, 20 diverse, well-chosen cases is already meaningful coverage. When labeled data is scarce, compensate with breadth of dimensions rather than volume of examples: a handful of dimensions that generalize (grounding, instruction-following, format compliance, safety) catch whole classes of failure that no realistic number of per-instance examples would. A judge specialized for grounding, for instance, flags product hallucinations across the board without needing one labeled example per hallucination type. Grow the dataset over time from sampled human review; do not let "we do not have enough data yet" be the reason there is no gate at all.
Loop 3: Canary deployment, live scoring
Fires on production rollout. A small fraction of real traffic (one to five percent) is routed to the new version. Outputs are scored live by the managed judge against the same rubric.
- Live per-dimension monitoring. Each dimension is a tracked metric with its own alert threshold.
- Automatic rollback. If any dimension drops below its floor for more than a defined window, traffic is rolled back to the prior version. The rollback is automatic; the post-mortem is human.
- Sampled human review. A small fraction of canary outputs (deliberately biased toward the boundary slice) is queued for human review. Labels feed back into the ground-truth dataset.
Target latency: continuous, with alerts on the order of minutes.
Loop 4: Production drift, slow watch
Always running. The same managed judge that gates PRs and canaries also scores a sample of full production traffic.
- Drift dashboards. Per-dimension score distributions plotted over time, with annotations for every model, prompt, and dataset change.
- Anomaly alerts. Sudden shifts (any dimension moving more than the historical noise floor) fire to the on-call rotation.
- Dataset refresh. Production traces sampled at the boundary are continuously fed into human review and into the ground-truth dataset. The dataset is itself versioned.
This loop catches what the others cannot: drift driven by the live input distribution, by an upstream model change you did not initiate, or by a slow-burn change in user behavior.
Versioning every input to the gate
Reproducibility requires pinning everything:
- Prompt version. Semver, stored in git.
- Model identifier and pin. Provider, model name, exact checkpoint, temperature, seed.
- Dataset version. Hash of the ground-truth file, plus a changelog row per added or removed example.
- Judge version. The judge prompt has a semver of its own, and the judge's model is pinned the same way the application model is.
- Rubric version. Adding a dimension is a minor bump; reweighting the composite is a minor bump; removing a dimension is a major bump.
A gate that cannot be reproduced cannot be trusted. A release bundle that includes all five pins can be rerun byte-for-byte on a different machine to confirm a result.
Example
A team shipping a customer-support assistant runs the four loops over a typical week.
- Monday. An engineer opens a PR that adds a new constraint to the system prompt. The PR suite runs in 2 minutes 40 seconds: 28 deterministic checks pass, the judge scores the 60-example boundary slice. Faithfulness goes up by 0.04, tone holds, brevity drops by 0.06 (below the floor of 0.70). The PR is blocked with a comment listing the three examples that regressed.
- Tuesday. The engineer pushes a follow-up commit constraining the new behavior to a single sentence. Suite reruns. Brevity recovers; all dimensions hold or improve. PR is approved and merged. The merge suite runs the full 600-example pass in 18 minutes. Judge agreement with the labeled set is 0.88, above the 0.85 floor.
- Wednesday. The new version rolls out as a one-percent canary. Live judge scoring runs against canary outputs. Tone drifts in production traffic that did not appear in the dataset (informal customer messages the dataset under-represents). The dimension crosses its floor for ten minutes; the system rolls back automatically.
- Thursday. A sampled batch of the failing canary outputs is reviewed by a domain expert and added to the ground-truth dataset (dataset version bumped). The prompt revision is reworked against the refreshed dataset.
- Friday. The reworked revision passes the full suite and the canary. Promotion is automatic. The release bundle (prompt hash, model pin, dataset hash, judge hash, rubric hash) is stored alongside the deployed image.
The shape that repeats is: every change is gated, every gate runs the same rubric, every regression has a reproducible bundle, and the dataset grows from every cycle.
Limitations
- Static datasets go stale. The single biggest reason an eval gate quietly stops working is a dataset that no longer reflects live traffic. Refresh from sampled human review on a cadence, and treat the dataset version as a release artifact.
- The judge is itself a model. A managed LLM judge has its own drift, its own model version, and its own calibration burden. Recalibrate every time the judge's model changes, and track judge-versus-human agreement as a first-class metric.
- Latency creeps into the PR suite. A suite that takes 15 minutes to run on a PR will be skipped. Keep the PR suite fast (under three minutes) by running only the boundary slice and pushing the full pass to merge time.
- Composite scores can hide regressions. A composite that rises while one dimension falls can ship a worse product. Per-dimension floors are mandatory; the composite is informational.
- A gate that never fails is not a gate. If no PR has ever been blocked, the gate is too loose, the dataset is too easy, or both. Track block rate as a meta-metric.
- Sampling matters. Production drift loops that score uniformly random traffic miss the slice that matters. Bias the sample toward the boundary (judge score near the threshold) and toward newly observed input patterns.
Evidence and sources
- GitHub Actions documentation, https://docs.github.com/en/actions, for the standard CI scaffolding that wraps any evaluation suite.
- OpenAI Evals cookbook, https://cookbook.openai.com/, for layered evaluation patterns (deterministic plus judge plus human) and dataset construction guidance.
- Anthropic model evaluation documentation, https://docs.anthropic.com/en/docs/test-and-evaluate/, for rubric-driven judge construction and inter-rater calibration patterns.
Numeric figures sometimes quoted in CI/CD evaluation write-ups (specific p-values, drift percentages, judge agreement scores) are typically reported without enough methodological detail to reproduce. Anchor your gate on your own ground-truth dataset and your own measured noise floor.
FAQ
What is the minimum viable eval gate for a small team? Far less than people assume. A deterministic check suite plus a managed LLM judge, wired into the PR template, with per-dimension 0 to 1 scoring, a per-dimension floor, and a composite rule. The dataset can be tiny: any judge run against even 10 examples is better than shipping on vibes, and if the judge is strong, 20 diverse, well-chosen test cases is already a lot. The trick when data is scarce is to compensate with breadth of dimensions rather than volume of examples. Add several evaluation dimensions that generalize (grounding, instruction-following, format compliance, safety) instead of trying to collect a data point for every specific failure. A judge specialized for grounding checks catches a whole class of product hallucinations without needing one labeled example per hallucination type, because the dimension generalizes where individual data points do not. Start there; everything else (more examples, canary scoring, drift dashboards, dataset refresh loops) is incremental.
How often should the ground-truth dataset be refreshed? Continuously. Sampled human review feeds new examples in every week. Tag every refresh as a dataset version so a regression can be attributed to a dataset change or to a prompt change separately.
Should the same judge run at PR time and in production? Yes, ideally. Running the same judge end-to-end is what makes the PR signal predictive of the production signal. If the production judge differs from the PR judge, you are gating on the wrong thing.
How do I prevent the gate from blocking legitimate improvements? Per-dimension floors and a composite rule, both negotiated and documented. The floor is the score the dimension currently holds in production; a revision either holds the floor or moves it up. Updating a floor is its own pull request.
What happens when the underlying model is upgraded? Treat it the same way you treat a prompt change. Run the full suite. Expect some dimensions to move; ratify the new floors explicitly in a pull request rather than silently letting the new model define the baseline.
How do I keep the suite fast enough to actually run on every PR? Stratify the dataset. The PR suite runs a small, deliberately hard subset (the boundary slice). The merge suite runs the full set. The production loop runs continuously on a small fraction of live traffic. No single loop has to do everything.
Can the gate be bypassed in an emergency? Yes, with a documented override that names a human owner and records the decision. A gate with no override path will be circumvented; a gate with an audit trail will not.