Updated: 2026-03-01 By: Ari Heljakka
Short answer
A composite evaluation score collapses many dimensions of LLM quality (accuracy, faithfulness, refusal correctness, safety, latency, cost) into a single number. The number is useful as a release gate or trend indicator; it is dangerous as a quality verdict because it hides the per-dimension regressions that drove the change. The defensible practice is to normalize every dimension to 0 to 1, weight each by operational priority, expose the per-dimension breakdown alongside the headline, version both the weights and the underlying judges, and never report the composite without surfacing what moved beneath it. Read this way, a composite score is a compression of a tradeoff frontier into one number; read any other way, it is theater.
Key facts
- Definition: A scalar score derived from multiple per-dimension scores, used as a release gate or aggregate quality indicator.
- When to use: Anywhere multiple dimensions need to be tracked but a single number is needed for gating, dashboards, or stakeholder reporting.
- Limitations: Hides per-dimension regressions; sensitive to weight choices; only meaningful when underlying scores are pinned to versioned judges and rubrics.
- Example: A composite of 0.84 can mean "0.85 faithfulness, 0.78 helpfulness, 0.90 safety" or "0.99 faithfulness, 0.60 helpfulness, 0.95 safety"; both score the same, only one is shippable.
Key takeaways
- Always normalize per-dimension scores to 0 to 1; mixing scales breaks weighting.
- Weights encode operational priorities; document and version them as artifacts.
- Composite scores need the per-dimension breakdown to be interpretable.
- Use minimum-based composition for safety-critical dimensions (any single failure fails the composite).
- Pinned judges are non-negotiable; a drifting judge invalidates every historical comparison.
Definition
A composite LLM evaluation score is a scalar derived by combining per-dimension scores through a fixed rule. The rule can be weighted average, weighted geometric mean, minimum, weighted minimum, or a more complex function. The output is a single number; the rule, the weights, and the per-dimension scores are the artifacts behind it.
Three properties decide whether a composite score is informative:
- Normalization. Every input score is on the same 0 to 1 scale so the weights are meaningful.
- Independence. Dimensions are orthogonal; overlap means a property is double-counted.
- Provenance. The composite carries the underlying scores, weights, and judge versions; the headline is interpretable only with that context.
The composite is a compression of information; the information it loses is the tradeoff geometry. Composite scores are useful for release gates and trend monitoring; they are inadequate for diagnosis.
When this matters
- Release gates. CI/CD pipelines need a single pass/fail to act on, and a well-constructed composite is what fills that slot.
- Dashboards and stakeholder reports. Leadership audiences track one number over time, and the right composite moves predictably enough to be that number.
- Trend monitoring. Plotting the composite over time surfaces drift; per-dimension drift then explains the trend.
- Cross-prompt comparison. Different prompts on the same evaluation framework get comparable composites.
- Vendor or model comparison. A composite normalizes across stacks the way per-dimension scores do not.
How it works
Building a composite score that survives the second use requires five constructional decisions. Each decision is a versioned artifact.
Decision 1: Pick orthogonal dimensions
Dimensions should be independent. Faithfulness and accuracy often overlap on grounded tasks; tone and helpfulness often overlap on conversational tasks. Audit by computing per-example correlation; collapse pairs that move together more than 80 percent of the time, or pick the more interpretable one.
A working set for many production prompts: faithfulness, helpfulness, refusal correctness, safety, latency, cost. Five to seven dimensions is typical; below four loses information, above eight inflates measurement cost.
Decision 2: Normalize each dimension to 0 to 1
Different scorers naturally output different ranges. A token counter is unbounded; a Likert score runs 1 to 5; a percentage runs 0 to 100. Normalize all of them to 0 to 1, with 1 meaning "perfectly meets the bar":
- Latency: 1 if below the SLO floor, 0 if at the SLO ceiling, linear between.
- Cost: 1 at the budget floor, 0 at twice the budget, linear between.
- Likert 1 to 5: divide by 5 and offset.
- Pass-rate metrics: already 0 to 1.
The normalization curve is a versioned artifact; changing it changes the composite without any underlying model change.
Decision 3: Pick a composition rule
Three common rules; pick by the semantics you want.
- Weighted arithmetic mean. Standard choice when dimensions are substitutable. A high score on one dimension compensates for a lower score on another. Useful for general quality composites.
- Weighted geometric mean. Penalizes low scores more aggressively (the product collapses toward zero when any factor is near zero). Useful when partial failures are real failures.
- Weighted minimum. Returns the lowest weighted score. Useful when one dimension is non-negotiable (safety): any failure fails the composite.
A hybrid is common in production: weighted geometric mean on the soft dimensions, weighted minimum on the hard floors (safety, refusal correctness). The composite is "the geometric mean, capped at the worst hard-floor score."
Decision 4: Pick the weights and document them
Weights encode operational priority; they are not technical decisions, they are product decisions. A useful discipline:
- Weights sum to 1.0; explicit so the reader can recompute.
- Weights are versioned with a rationale (one sentence on why each weight has the value it does).
- Hard floors (safety, refusal correctness) carry threshold metadata in addition to weights: "this dimension scores 0 at any release below 0.95, regardless of the other dimensions."
Weights are not silently revisable. A change to the weights is a release-gate change that goes through the same review as a prompt change.
Decision 5: Pin the underlying judges
The composite is only stable to the extent that every per-dimension scorer is stable. Pin every judge model and judge prompt as a versioned artifact. Track judge agreement against human labels; promote a judge to the composite only when its agreement clears a threshold (Matthews 0.6 for binary, rank correlation 0.7 for graded).
A drifting judge silently rewrites the composite. The single most common source of "the composite is fine but the system is worse" is judge drift on a dimension nobody is tracking independently.
Example
A team scores a research-assistant agent. The composite combines five dimensions:
- Faithfulness, weight 0.30, hard floor 0.85
- Helpfulness, weight 0.25
- Refusal correctness, weight 0.15, hard floor 0.95
- Latency, weight 0.15 (normalized: 1 at 800 ms p95, 0 at 2000 ms p95)
- Cost, weight 0.15 (normalized: 1 at 400 tokens, 0 at 1200 tokens)
Composition rule: weighted geometric mean on faithfulness, helpfulness, latency, cost; weighted minimum on the hard floors.
Two candidate prompts:
| Dimension | Candidate A | Candidate B |
|---|---|---|
| Faithfulness | 0.88 | 0.96 |
| Helpfulness | 0.85 | 0.72 |
| Refusal correctness | 0.97 | 0.96 |
| Latency (norm.) | 0.72 | 0.55 |
| Cost (norm.) | 0.74 | 0.62 |
| Composite | 0.83 | 0.78 |
A scores higher on the composite, but the diagnostic surfaces the tradeoff: B is more faithful but less helpful and slower. For a high-stakes use case where faithfulness matters more than helpfulness, the team raises faithfulness's weight to 0.45 (recomputing both composites: A becomes 0.82, B becomes 0.81). The reweighting is recorded as a versioned change with a one-sentence rationale.
A regulator asking "why did the team accept candidate A in week 6?" receives a specific answer: the composite, the per-dimension scores, the weights and their rationale, the judge versions, and the prior reweighting decision. The composite by itself does not answer the question; the artifact set does.
Limitations
- The composite is only as good as the dimensions. Missing a dimension that should be measured (because nobody built a scorer for it) means the composite cannot move when that dimension regresses.
- Weight choices encode opinions. Two stakeholders disagreeing on weights are disagreeing about product priorities, not about evaluation methodology. The composite makes the disagreement explicit; it does not resolve it.
- Geometric and minimum composition can be too punishing. A single dimension at 0 collapses the geometric mean and the minimum to 0. Set floor thresholds carefully; not every low score deserves a release block.
- Composite trends can mask compensating regressions. Faithfulness drops by 0.05 while helpfulness rises by 0.05; the composite is flat, the system is worse. Always plot per-dimension trends alongside the composite.
- Cross-team composites need versioned weights. Different teams using the same composite framework with different weights cannot compare numbers; the weights must be part of the comparison.
- Composite scoring tempts over-aggregation. A single number for the whole agent is convenient and almost always misleading. Compute composites per workflow or per use case; system-wide composites are usually not actionable.
Evidence and sources
- HELM: Holistic Evaluation of Language Models. crfm.stanford.edu/helm
- Pareto-optimality in multi-objective optimization (Wikipedia). en.wikipedia.org/wiki/Pareto_efficiency
- BIG-bench: a collaborative benchmark for LLM capabilities. github.com/google/BIG-bench
FAQ
Should I report the composite or the per-dimension scores? Both. The composite is the gate; the per-dimension scores are the diagnosis. Never one without the other.
What composition rule should I default to? Weighted geometric mean with hard floors on safety-critical dimensions. The geometric mean penalizes low scores more than the arithmetic mean, which matches the intuition that a partial failure is closer to a full failure than to a partial success.
How do I set the weights? By product priority, not by gut feel. Start with equal weights; adjust when stakeholders disagree on a release; record the adjustment as a versioned event with a rationale.
Can I have a different composite per use case? Yes, and you usually should. A summarization agent and a refusal-heavy compliance agent need different weight structures.
What if a judge changes between releases? Stop and recalibrate. A change in a judge invalidates every comparison made with it. Either pin the judge version or rerun every historical sample against the new judge before comparing.
How does this relate to Pareto frontiers? A composite is a scalar projection of a frontier. The frontier is the diagnostic; the composite is the gate. Use the frontier to choose the weights; use the composite to ship.