Updated: 2026-04-26 By: Ari Heljakka
Short answer
Noise in agentic LLM evals is the run-to-run variance that hides whether a change really helped. The fix is not a bigger eval set; it is to (1) separate prediction noise from data noise, (2) score model variants on the same items using pairwise comparisons, and (3) report confidence intervals from bootstrapping or paired tests instead of single point scores. Without these, "+2 points on the benchmark" usually means nothing.
Key facts
- Definition: Noise in an agentic eval is any variance in the reported score that is not caused by the underlying capability of the system under test. It splits into prediction noise (the same prompt scored differently across runs or judges) and data noise (the benchmark items themselves vary in difficulty, labels, or framing).
- When to use: Any time a model change, prompt change, tool change, or judge change is being decided on the basis of a numeric eval delta, especially for agentic systems where multi-step rollouts amplify variance.
- Limitations: Noise reduction does not fix a benchmark that measures the wrong thing. It also cannot rescue an eval with too few items per slice, or a judge whose rubric is fundamentally miscalibrated against human labels.
- Example: A 70% vs 72% headline difference between two agent configurations vanishes once paired bootstrap confidence intervals are computed across the same 200 tasks; the "improvement" was within run-to-run variance.
Key takeaways
- Agentic evals are noisier than single-turn evals because each rollout compounds randomness from sampling, tool calls, and judge variance.
- Most reported eval gains are not statistically distinguishable from noise; pairing variants on the same items and reporting confidence intervals fixes this.
- Inter-rater reliability between LLM judges and human labels (Cohen's kappa, Krippendorff's alpha) is the calibration signal that tells you whether a judge is worth trusting.
- Variance reduction is a methodology, not a tool: it requires explicit decisions about sample size, pairing, seeds, and judges, not a new dashboard.
Definition
Noise in agentic LLM evals is the portion of measured variance that does not correspond to a real capability difference between systems. Two sources dominate:
- Prediction noise is variance from the model under test and from the judge: temperature, sampling, tool-call non-determinism, and the LLM-as-judge's own run-to-run variance on the same input.
- Data noise is variance from the benchmark itself: item-level ambiguity, label disagreement among human annotators, distributional skew across slices, and contamination.
A trustworthy eval result separates these two and reports uncertainty for each. A point score with no interval is, in practice, an opinion.
When this matters
Noise control becomes critical in any of the following situations:
- Model selection. Choosing between two frontier models on a benchmark where the headline delta is single-digit.
- Prompt or policy changes. Deciding whether a new system prompt, planning policy, or tool schema actually improved performance.
- Judge upgrades. Migrating an LLM-as-judge to a newer judge model, where calibration on the old rubric is not guaranteed.
- Agentic rollouts. Multi-step agents amplify variance: every step adds sampling noise, tool latency, and partial failures that propagate into the final score.
- Regression gating. Any CI gate that blocks a deploy on a numeric eval threshold; without intervals, the gate trips on noise.
If the same eval is rerun three times and produces three meaningfully different scores, the eval is not yet measuring the system; it is measuring the run.
How it works
A rigorous noise-aware eval has four moving parts: a careful benchmark, paired comparisons, repeated runs, and statistical reporting.
Separate prediction noise from data noise
Run the same configuration multiple times across the same items and decompose the total variance:
- Within-item variance across repeated rollouts of the same configuration estimates prediction noise.
- Between-item variance holding configuration fixed estimates data noise.
This decomposition tells you whether to invest in more seeds (prediction noise dominant) or more items (data noise dominant).
Use pairwise comparisons on the same items
Comparing two systems on the same item set, with the same seeds where applicable, removes most of the data noise from the comparison. Pairwise design is dramatically more powerful than comparing two independent score averages.
- For LLM-as-judge: ask the judge to compare A vs B head-to-head on each item rather than scoring each in isolation.
- For numeric scorers: subtract the per-item scores and analyze the distribution of differences, not the marginal means.
A paired test on n items is roughly equivalent in power to an unpaired test on several times as many items, depending on the correlation between A and B per item.
Account for prediction noise with repeated rollouts
Single-run eval results conflate the system's mean behavior with a single random draw. Repeat each configuration on each item with multiple seeds or temperatures and aggregate. The number of repeats needed scales with the per-item variance you measured in the first step.
Report uncertainty, not point scores
Three tools cover most cases:
- Bootstrapping over items (and over repeats) produces a confidence interval for the mean score and for the A vs B delta.
- Paired tests (paired bootstrap, sign test, Wilcoxon signed-rank) give a p-value or interval for the per-item difference, far tighter than unpaired comparisons.
- Inter-rater reliability statistics (Cohen's kappa, Krippendorff's alpha) quantify how well an LLM judge agrees with human labels. Below a meaningful kappa, the judge score is uncalibrated and downstream confidence intervals are misleading.
Calibrate the judge before you trust the score
LLM-as-judge variance is its own subspecies of prediction noise. Calibration steps that pay back:
- Score a labeled gold set with the judge, compute kappa or alpha against human labels.
- Run the judge multiple times on the same items to estimate its self-consistency.
- Prefer pairwise judging over absolute scoring for subjective rubrics; pairwise rubrics tend to have higher inter-rater agreement.
- Update the judge prompt or model only when the new version passes a calibration regression on the gold set.
Example
Team has two agent configurations A and B, a 200-item benchmark, and an LLM-as-judge that scores 1 to 5.
- Single run, naive comparison: A scores 3.40, B scores 3.52. Headline: "B is 3.5% better."
- Repeated rollouts: Each configuration is run 5 times per item with different seeds. A's mean is 3.41 with within-item standard deviation 0.42; B's mean is 3.50 with within-item standard deviation 0.38. Prediction noise is large relative to the apparent gap.
- Pairwise judging: The judge is rerun in head-to-head mode, ranking A vs B on each item across the 5 paired rollouts. B wins 51% of pairings, A wins 44%, ties 5%.
- Paired bootstrap: 10,000 bootstrap resamples over items yield a 95% confidence interval for B minus A of [-0.02, +0.18]. The interval crosses zero. The "improvement" is not statistically distinguishable from noise.
- Action: Either collect more items in the slice where B looked stronger, or accept that A and B are tied on this benchmark and decide on cost, latency, or other criteria instead.
The same numerical workflow applies to tool-use accuracy, faithfulness, instruction-following, or any agentic metric whose value is reported as a mean.
Limitations
Noise reduction is not free, and it does not solve every problem an eval can have:
- It cannot fix a wrong rubric. A perfectly calibrated, low-variance score against the wrong criterion is worse than a noisy score against the right one.
- Sample size has a floor. Per-slice intervals require enough items in the slice. Slicing an eval into 12 categories of 15 items each produces 12 wide intervals; resist over-slicing.
- Pairing is not always possible. If A and B use different tools, different routing, or different rollout structures, per-item pairing breaks down and unpaired comparisons (with their wider intervals) are the honest fallback.
- Judge drift is ongoing. A calibrated judge today may be miscalibrated next quarter as the judge model is updated. Calibration is a recurring task, not a one-off.
- Cost compounds. Repeated rollouts multiply inference cost by the number of seeds; bootstrapping is cheap, repeats are not. Budget accordingly.
Evidence and sources
Primary source
- Dr. Sida Wang, "Measuring all the noises of agentic LLM evals," AI Agent Frontier (YouTube), 2026-03-23. Video: https://www.youtube.com/watch?v=AT4zQLVX7_g.
Related reading
FAQ
Why are agentic evals noisier than single-turn evals? Each step in an agentic rollout introduces independent randomness: model sampling, tool-call timing, retrieval results, partial failures, and the judge's own variance on the final trace. Variance compounds across steps, so even modest per-step noise becomes large at the rollout level.
Is increasing the eval set size always the right answer? No. If prediction noise dominates, more items barely help; what you need is more seeds per item. If data noise dominates, more items help, but only if they are drawn from the right distribution. Decompose the variance first, then decide.
What is a "good enough" inter-rater agreement for an LLM judge? There is no universal threshold, but Cohen's kappa or Krippendorff's alpha above roughly 0.6 against expert human labels on a calibration set is a common working bar. Below that, judge scores should be treated as directional, not authoritative, and pairwise judging often recovers some reliability.
Are pairwise comparisons always better than absolute scoring? For subjective rubrics, pairwise comparisons usually have higher inter-rater agreement and tighter paired intervals. For objective rubrics (exact match, schema validity), absolute scoring is fine. Mixing both is common: absolute for objective criteria, pairwise for everything else.
Can I skip statistical tests if my eval delta is large? If you genuinely see a 30-point gap with non-overlapping intervals across hundreds of items, formal testing adds little. For the more common 1 to 5 point claimed deltas, paired tests with confidence intervals are the difference between a real result and a noisy one.
What is the cheapest first step to reduce eval noise? Rerun your headline eval three times with different seeds and look at the spread of scores. If the spread is comparable to the differences you have been reporting, every previous "improvement" is suspect, and the rest of this workflow becomes the obvious investment.