How do you interpret a composite LLM evaluation score?

How do you interpret a composite LLM evaluation score?

Updated: 2026-03-01 By: Ari Heljakka

Short answer

A composite evaluation score collapses many dimensions of LLM quality (accuracy, faithfulness, refusal correctness, safety, latency, cost) into a single number. The number is useful as a release gate or trend indicator; it is dangerous as a quality verdict because it hides the per-dimension regressions that drove the change. The defensible practice is to normalize every dimension to 0 to 1, weight each by operational priority, expose the per-dimension breakdown alongside the headline, version both the weights and the underlying judges, and never report the composite without surfacing what moved beneath it. Read this way, a composite score is a compression of a tradeoff frontier into one number; read any other way, it is theater.

Key facts

  • Definition: A scalar score derived from multiple per-dimension scores, used as a release gate or aggregate quality indicator.
  • When to use: Anywhere multiple dimensions need to be tracked but a single number is needed for gating, dashboards, or stakeholder reporting.
  • Limitations: Hides per-dimension regressions; sensitive to weight choices; only meaningful when underlying scores are pinned to versioned judges and rubrics.
  • Example: A composite of 0.84 can mean "0.85 faithfulness, 0.78 helpfulness, 0.90 safety" or "0.99 faithfulness, 0.60 helpfulness, 0.95 safety"; both score the same, only one is shippable.

Key takeaways

  • Always normalize per-dimension scores to 0 to 1; mixing scales breaks weighting.
  • Weights encode operational priorities; document and version them as artifacts.
  • Composite scores need the per-dimension breakdown to be interpretable.
  • Use minimum-based composition for safety-critical dimensions (any single failure fails the composite).
  • Pinned judges are non-negotiable; a drifting judge invalidates every historical comparison.

Definition

A composite LLM evaluation score is a scalar derived by combining per-dimension scores through a fixed rule. The rule can be weighted average, weighted geometric mean, minimum, weighted minimum, or a more complex function. The output is a single number; the rule, the weights, and the per-dimension scores are the artifacts behind it.

Three properties decide whether a composite score is informative:

  • Normalization. Every input score is on the same 0 to 1 scale so the weights are meaningful.
  • Independence. Dimensions are orthogonal; overlap means a property is double-counted.
  • Provenance. The composite carries the underlying scores, weights, and judge versions; the headline is interpretable only with that context.

The composite is a compression of information; the information it loses is the tradeoff geometry. Composite scores are useful for release gates and trend monitoring; they are inadequate for diagnosis.

When this matters

  • Release gates. CI/CD pipelines need a single pass/fail to act on, and a well-constructed composite is what fills that slot.
  • Dashboards and stakeholder reports. Leadership audiences track one number over time, and the right composite moves predictably enough to be that number.
  • Trend monitoring. Plotting the composite over time surfaces drift; per-dimension drift then explains the trend.
  • Cross-prompt comparison. Different prompts on the same evaluation framework get comparable composites.
  • Vendor or model comparison. A composite normalizes across stacks the way per-dimension scores do not.

How it works

Building a composite score that survives the second use requires five constructional decisions. Each decision is a versioned artifact.

Decision 1: Pick orthogonal dimensions

Dimensions should be independent. Faithfulness and accuracy often overlap on grounded tasks; tone and helpfulness often overlap on conversational tasks. Audit by computing per-example correlation; collapse pairs that move together more than 80 percent of the time, or pick the more interpretable one.

A working set for many production prompts: faithfulness, helpfulness, refusal correctness, safety, latency, cost. Five to seven dimensions is typical; below four loses information, above eight inflates measurement cost.

Decision 2: Normalize each dimension to 0 to 1

Different scorers naturally output different ranges. A token counter is unbounded; a Likert score runs 1 to 5; a percentage runs 0 to 100. Normalize all of them to 0 to 1, with 1 meaning "perfectly meets the bar":

  • Latency: 1 if below the SLO floor, 0 if at the SLO ceiling, linear between.
  • Cost: 1 at the budget floor, 0 at twice the budget, linear between.
  • Likert 1 to 5: divide by 5 and offset.
  • Pass-rate metrics: already 0 to 1.

The normalization curve is a versioned artifact; changing it changes the composite without any underlying model change.

Decision 3: Pick a composition rule

Three common rules; pick by the semantics you want.

  • Weighted arithmetic mean. Standard choice when dimensions are substitutable. A high score on one dimension compensates for a lower score on another. Useful for general quality composites.
  • Weighted geometric mean. Penalizes low scores more aggressively (the product collapses toward zero when any factor is near zero). Useful when partial failures are real failures.
  • Weighted minimum. Returns the lowest weighted score. Useful when one dimension is non-negotiable (safety): any failure fails the composite.

A hybrid is common in production: weighted geometric mean on the soft dimensions, weighted minimum on the hard floors (safety, refusal correctness). The composite is "the geometric mean, capped at the worst hard-floor score."

Decision 4: Pick the weights and document them

Weights encode operational priority; they are not technical decisions, they are product decisions. A useful discipline:

  • Weights sum to 1.0; explicit so the reader can recompute.
  • Weights are versioned with a rationale (one sentence on why each weight has the value it does).
  • Hard floors (safety, refusal correctness) carry threshold metadata in addition to weights: "this dimension scores 0 at any release below 0.95, regardless of the other dimensions."

Weights are not silently revisable. A change to the weights is a release-gate change that goes through the same review as a prompt change.

Decision 5: Pin the underlying judges

The composite is only stable to the extent that every per-dimension scorer is stable. Pin every judge model and judge prompt as a versioned artifact. Track judge agreement against human labels; promote a judge to the composite only when its agreement clears a threshold (Matthews 0.6 for binary, rank correlation 0.7 for graded).

A drifting judge silently rewrites the composite. The single most common source of "the composite is fine but the system is worse" is judge drift on a dimension nobody is tracking independently.

Example

A team scores a research-assistant agent. The composite combines five dimensions:

  • Faithfulness, weight 0.30, hard floor 0.85
  • Helpfulness, weight 0.25
  • Refusal correctness, weight 0.15, hard floor 0.95
  • Latency, weight 0.15 (normalized: 1 at 800 ms p95, 0 at 2000 ms p95)
  • Cost, weight 0.15 (normalized: 1 at 400 tokens, 0 at 1200 tokens)

Composition rule: weighted geometric mean on faithfulness, helpfulness, latency, cost; weighted minimum on the hard floors.

Two candidate prompts:

DimensionCandidate ACandidate B
Faithfulness0.880.96
Helpfulness0.850.72
Refusal correctness0.970.96
Latency (norm.)0.720.55
Cost (norm.)0.740.62
Composite0.830.78

A scores higher on the composite, but the diagnostic surfaces the tradeoff: B is more faithful but less helpful and slower. For a high-stakes use case where faithfulness matters more than helpfulness, the team raises faithfulness's weight to 0.45 (recomputing both composites: A becomes 0.82, B becomes 0.81). The reweighting is recorded as a versioned change with a one-sentence rationale.

A regulator asking "why did the team accept candidate A in week 6?" receives a specific answer: the composite, the per-dimension scores, the weights and their rationale, the judge versions, and the prior reweighting decision. The composite by itself does not answer the question; the artifact set does.

Limitations

  • The composite is only as good as the dimensions. Missing a dimension that should be measured (because nobody built a scorer for it) means the composite cannot move when that dimension regresses.
  • Weight choices encode opinions. Two stakeholders disagreeing on weights are disagreeing about product priorities, not about evaluation methodology. The composite makes the disagreement explicit; it does not resolve it.
  • Geometric and minimum composition can be too punishing. A single dimension at 0 collapses the geometric mean and the minimum to 0. Set floor thresholds carefully; not every low score deserves a release block.
  • Composite trends can mask compensating regressions. Faithfulness drops by 0.05 while helpfulness rises by 0.05; the composite is flat, the system is worse. Always plot per-dimension trends alongside the composite.
  • Cross-team composites need versioned weights. Different teams using the same composite framework with different weights cannot compare numbers; the weights must be part of the comparison.
  • Composite scoring tempts over-aggregation. A single number for the whole agent is convenient and almost always misleading. Compute composites per workflow or per use case; system-wide composites are usually not actionable.

Evidence and sources

FAQ

Should I report the composite or the per-dimension scores? Both. The composite is the gate; the per-dimension scores are the diagnosis. Never one without the other.

What composition rule should I default to? Weighted geometric mean with hard floors on safety-critical dimensions. The geometric mean penalizes low scores more than the arithmetic mean, which matches the intuition that a partial failure is closer to a full failure than to a partial success.

How do I set the weights? By product priority, not by gut feel. Start with equal weights; adjust when stakeholders disagree on a release; record the adjustment as a versioned event with a rationale.

Can I have a different composite per use case? Yes, and you usually should. A summarization agent and a refusal-heavy compliance agent need different weight structures.

What if a judge changes between releases? Stop and recalibrate. A change in a judge invalidates every comparison made with it. Either pin the judge version or rerun every historical sample against the new judge before comparing.

How does this relate to Pareto frontiers? A composite is a scalar projection of a frontier. The frontier is the diagnostic; the composite is the gate. Use the frontier to choose the weights; use the composite to ship.

Related reading