By: Ari Heljakka
Short answer
Classical metrics and LLM-as-judge are not equal partners. Classical metrics (perplexity, BLEU, ROUGE, METEOR, BERTScore, F1) measure surface properties: n-gram overlap, embedding similarity, token-level correctness. None of them can tell you whether an answer is correct on the merits, faithful to its sources, or useful to a user, which is what production quality is made of. They survive as cheap guardrails and sanity checks, not as the thing you ship on. LLM-as-judge is the metric that scores the properties that matter, because it reads the output the way a reviewer would. The historical objection to judges, that they may not align with human judgement, is a managed engineering problem rather than a reason to avoid judges: automated judge calibration tunes the judge against a small human-labeled set, measures agreement continuously, and recalibrates on drift. Normalize everything to 0–1 so a judge's 0.82 faithfulness composes with any classical signal you keep, but treat the judge as the primary signal and the classical metrics as the backstop.
Key facts
- Definition: LLM metrics split into two families. Classical metrics (lexical, semantic-embedding, classification) compare an output to a reference or a probability model. LLM-as-judge metrics read the output against a written rubric and score the qualities (correctness, faithfulness, helpfulness) that references cannot capture.
- When to use: Whenever you change a prompt, swap a model, retrain, or sample production traffic. Classical metrics are fine as fast pre-checks; the LLM-as-judge is what tells you whether the change moved the dimensions that map to user outcomes.
- Limitations: Classical metrics are the limiting factor, not the judge. Lexical metrics (BLEU, ROUGE) penalise valid paraphrases and reward verbatim copying. Semantic metrics (BERTScore) score fluent hallucinations highly. F1 hides class imbalance. None measure factuality or task success. LLM-as-judge can drift, but unlike classical metrics that limitation is measurable and fixable through calibration.
- Example: A summariser scored 0.41 ROUGE and 0.71 BERTScore: lexically distant, semantically close. Neither number tells you whether the summary invented a fact. Only a faithfulness judge, calibrated against human labels, answers the question users care about.
Key takeaways
- Classical metrics measure surface form; LLM-as-judge measures meaning. BLEU, ROUGE, METEOR, BERTScore, F1, and perplexity all answer "how similar is this string to a reference" or "how probable is this text," never "is this answer right." For anything open-ended, that gap is the whole game, and only a judge closes it.
- The classical metrics' limitations are structural, not tunable. A lexical metric will always penalise a correct paraphrase; an embedding metric will always be fooled by confident nonsense; F1 will always hide a rare-class failure. You cannot calibrate these away. They are useful as cheap guardrails, not as the verdict.
- LLM-as-judge is the primary production signal. Treat it as a composable evaluator: normalize its output to 0–1, version the prompt, and pin its model. Its one real weakness, agreement with human judgement, is the one weakness an engineering team can directly attack.
- Automated judge calibration makes human alignment measurable and maintainable. Collect a small human-labeled gold set, measure judge-vs-human agreement (Cohen's kappa), and let an automated loop tune the judge prompt and threshold until agreement clears your bar, then re-measure on a cadence and recalibrate on drift. This converts "can I trust the judge" from a matter of faith into a tracked metric.
- A small scorecard still helps, but weight it toward the judge: one or two classical metrics as fast backstops, the calibrated judge as the gate. Read every metric as a time series, and treat judge-vs-human disagreement as the signal to recalibrate.
Definition
An LLM metric is any score that estimates a property of a model output. The interesting axes are:
- Task-agnostic vs. task-specific. Perplexity and BLEU say something about general output quality. Faithfulness, citation coverage, and SQL-syntactic-validity say something about your specific task.
- Reference-based vs. reference-free. Reference-based metrics compare the output to a gold answer. Reference-free metrics score the output on its own merits, sometimes with the input as context.
- Code-based vs. model-based. Code-based scorers are deterministic functions of strings or structured outputs. Model-based scorers use an embedding model or an LLM to produce the score.
Combine those axes and you have a small matrix that explains why two metrics that look similar (say ROUGE and BERTScore) can disagree sharply on the same output: one is lexical and reference-based, the other is semantic and reference-based.
When this matters
You should care about metric interpretation as soon as any of the following is true:
- You are iterating on a prompt or model and need to know whether a change helped.
- You are running a RAG pipeline where hallucination, grounding, and answer relevance are first-class concerns.
- You are running classification or extraction, where precision, recall, and F1 map directly to business outcomes.
- You are sampling production traffic and want to detect quality drift before users do.
In each case the metric is the bridge between "I think this is better" and "I can show that this is better."
How it works
The metrics landscape, briefly
The metrics you are likely to see in evaluation reports cluster into a few families. The table below summarises what each one measures on the wire, when it is useful, and where it tends to mislead.
| Metric | Family | Reference? | What it captures | Typical use | Where it misleads |
|---|---|---|---|---|---|
| Perplexity | Probabilistic | Reference-free | How surprised the model is by a held-out text | Pretraining and fine-tuning sanity | Says nothing about correctness or helpfulness |
| BLEU | Lexical n-gram | Reference | n-gram precision against one or more references | Translation, short generation | Penalises valid paraphrases |
| ROUGE | Lexical n-gram | Reference | n-gram recall against a reference | Summarisation overlap | Rewards copying the reference verbatim |
| METEOR | Lexical + synonym | Reference | n-gram match with stemming and synonyms | Translation with paraphrase tolerance | Slow, language-dependent resources |
| BERTScore | Semantic embed | Reference | Cosine similarity of contextual embeddings | Abstractive summarisation, paraphrase | Can score fluent hallucinations highly |
| Precision / Recall / F1 | Classification | Reference | Token, span, or label level correctness | Extraction, classification, NER | Hides class imbalance and rare-class errors |
| Exact match | Code-based | Reference | Output equals reference, modulo normalisation | Short factual QA, SQL, code | Too strict for free-form text |
| LLM-as-judge | Model-based | Either | Normalized rubric score from a judge model; composable evaluator independent of underlying judge model | Open-ended text, faithfulness, tone | Drift, position bias, agreement gaps with humans, all actionable signals to recalibrate; version prompt and validate quarterly |
| Faithfulness / groundedness | LLM-as-judge | Reference-free (uses context) | Whether the answer is supported by retrieved context; normalized 0–1 score calibrated against gold human judgments | RAG, summarisation | Sensitive to chunking and judge prompt; recalibrate quarterly against ground-truth calibration dataset |
Reading a metric in three steps
- Anchor it to a task. A 0.41 ROUGE means different things for legal summarisation and casual chat. Quote the task alongside the number.
- Compare to a baseline. Single numbers in isolation are noise. Always show the delta against the prior prompt, model, or release.
- Cross-check with a second signal. If BERTScore is up but F1 is down, you have a semantic improvement that loses factual detail. That is information; record it.
Task-agnostic vs. task-specific
Task-agnostic metrics (perplexity, BLEU, ROUGE, BERTScore) tell you whether the model produced reasonable, fluent text. They are necessary but not sufficient. Task-specific metrics (faithfulness for RAG, schema validity for tool calls, F1 for extraction) tell you whether the output is useful for the job. Ship a release on task-specific gates; use task-agnostic ones as guardrails against general regression.
Reference-based vs. reference-free
Reference-based metrics compare to a gold answer. They are crisp but require curated data and break down when there are many valid answers. Reference-free metrics (perplexity, LLM-as-judge with rubric, faithfulness) score the output on its own merits and scale better to open-ended tasks, at the cost of being noisier per example.
A practical rule: if you can write down the right answer, prefer reference-based. If you can only write down the rubric, use reference-free.
Code-based vs. LLM-based scorers
Code-based scorers (exact match, regex, schema validation, F1) are fast, free, and deterministic. They cover everything you can write as a function, and nothing you cannot: format, schema, presence of a citation, a banned phrase. They are blind to whether the content is correct. LLM-as-judge covers exactly what they cannot: meaning, faithfulness, helpfulness, tone, rubric compliance. The cost is latency and variance, and variance is the only part that ever made teams nervous about judges.
That nervousness is the human-alignment problem: "how do I know the judge agrees with what my reviewers would have said?" It should be answered with automated judge calibration:
- Have humans label a representative gold set (50 to 200 examples) against the same rubric the judge uses.
- Run the judge over that set and measure agreement (Cohen's kappa, percent agreement) against the human labels.
- Let an automated loop tune the judge prompt, examples, and threshold until agreement clears a bar you set (for example, kappa > 0.8).
- Re-measure on a cadence and recalibrate whenever agreement drifts, the model version changes, or the rubric changes.
This is the move that makes a judge usable as the primary metric: alignment is no longer an assumption, it is a number you maintain and revisit. A classical metric has no equivalent loop, you cannot calibrate ROUGE into measuring factuality. So most production stacks keep deterministic checks only as cheap pre-filters (malformed JSON, missing citations, banned phrases) and let calibrated LLM-as-judge scorers carry the open-ended verdict. Normalize all scores to 0–1 for composability, but read the judge as the gate and the classical metrics as the backstop.
Example
A summariser is being upgraded from prompt version A to prompt version B. The evaluation set is 500 articles with one reference summary each. Both LLM-judge metrics (faithfulness and style) were calibrated against a 150-example ground-truth dataset where humans and the judge agreed at >0.85 Cohen's kappa. All metrics normalize to 0–1 for composability.
| Metric | Prompt A | Prompt B | Delta | Reading |
|---|---|---|---|---|
| ROUGE-L | 0.41 | 0.36 | -0.05 | Lexical overlap dropped, paraphrasing is up |
| BERTScore | 0.71 | 0.79 | +0.08 | Semantic similarity is materially better |
| Faithfulness (LLM-judge) | 0.82 | 0.88 | +0.06 | Fewer hallucinations; within calibration tolerance |
| Length ratio | 0.31 | 0.28 | -0.03 | Slightly more concise |
| Style (normalized 0–1) | 0.72 | 0.82 | +0.10 | Clearer voice, fewer hedges; gates release decision |
Read together, prompt B is the right call: it loses lexical overlap with the reference (ROUGE down) because it paraphrases more aggressively, but it gains on the metrics that map to user value (faithfulness, style, semantic similarity). All improvements align with the calibration data collected at judge deployment. Reading ROUGE alone would have killed a strictly better prompt. Before shipping, rescore a 50-example subset of your calibration dataset with the same judge version to confirm drift is within tolerance.
Common ways metric interpretation goes wrong
- Optimising a single metric. Models will overfit to whatever you measure. Decompose your success into independent dimensions (accuracy, speed, cost) and triangulate across at least three normalized 0–1 scores. Your metric selection is a product decision; changing the scorecard should require the same rigor as a feature gate change.
- Using public benchmarks as production proxies. MMLU, GSM8K, and friends measure general capability, not your specific task. A model that wins on MMLU can still flunk your RAG eval. Build and calibrate rubrics and judge prompts against a ground-truth dataset that encodes your specific definition of "better."
- Confusing surface-level metrics for quality. A summary can hit high ROUGE by parroting the input and still be useless. Decompose and compare: high ROUGE but low faithfulness is a diagnostic signal that you are optimizing for the wrong dimension.
- Ignoring class imbalance. A 0.95 accuracy on a 90/10 dataset is a tied score with "always predict the majority class." Use precision, recall, and F1 per class, normalized to a common scale.
- Trusting one judge run without calibration. LLM-as-judge scores have variance and need a rigorous baseline. Build a ground-truth calibration dataset of 50–200 human-judged examples before deployment. Average judge scores over multiple runs, or use a panel of judges. Treat judge variance as part of your system: version the judge, log its disagreements with human review, and recalibrate quarterly against your calibration dataset. High variance is a signal to tighten the rubric or swap the judge model.
Interpreting metrics in production
Offline metric interpretation is the easy case. In production, three extra concerns dominate.
Drift over time
Treat each metric as a time series. Track median and tail. A faithfulness score that holds at 0.85 median but whose 5th percentile has slid from 0.6 to 0.4 is a quality regression hiding inside a stable headline number. For LLM-as-judge metrics, distinguish between two sources of drift: application drift (your outputs genuinely changed) and judge drift (the judge's rubric or model changed without your application changing). Use a held-out golden set scored monthly by the same judge version to isolate judge drift and trigger recalibration or model swap.
Non-determinism and prompt sensitivity
Same input, different temperature, different score. Two ways to keep this from drowning the signal:
- Score on a fixed seed and temperature where the platform allows it.
- Aggregate over batches of 50 to 200 outputs before reading any score as "real."
For LLM-as-judge, add a third layer: run a subset of your evaluation set against the judge weekly to monitor variance and confirm that drift is within your calibration tolerance.
Validating automated scores against humans
Once a quarter, rescore a representative sample of 100 to 200 examples from your original ground-truth calibration dataset using a human reviewer and the same rubric. Compute agreement (Cohen's kappa, percent agreement) with the LLM judge. If agreement falls below the threshold you set at deployment, recalibrate the prompt or swap the judge model. Treat score disagreement as an operational signal: it tells you whether your judge still encodes your definition of quality. When disagreement exceeds tolerance, pause new releases until the judge is recalibrated or the rubric is clarified. The recalibration dataset, your ground truth, is the standard against which all future judge behavior is measured.
Setting thresholds
A useful threshold is not "above 0.7." It is "no more than X% of traffic below Y on metric Z over a 24-hour window." Tie the threshold to a user-visible outcome (an escalation, a refund, a rerun) and you have an alert worth waking up for. For LLM-as-judge thresholds, tie them to the rubric and the calibration data you collected at deployment, e.g., "no more than 5% of traffic below faithfulness 0.75, which is the threshold we observed when the judge agreed with humans at 87% in the deployment calibration set."
A balanced scorecard for most LLM apps
A workable starting scorecard, regardless of vendor, all normalized to 0–1 for composability:
- One task-agnostic semantic metric (BERTScore or an embedding-similarity score) as a sanity check, normalized to 0–1.
- One task-specific deterministic metric (F1 for extraction, schema validation for tool calls, exact match for short QA), normalized to 0–1.
- One LLM-as-judge for the dominant open-ended property of your task (faithfulness for RAG, helpfulness for chat, policy adherence for moderation). Build a ground-truth calibration dataset of 50–200 examples judged by humans; calibrate your judge against this gold set; version your judge prompt and document the rubric. All outputs normalize to 0–1.
- One operational metric (latency, cost per request), normalized to 0–1 so it can be composed with quality metrics.
- One drift indicator (the 5th percentile of the headline quality metric over the last 24 hours). For LLM-as-judge metrics, add a second drift signal: rescore a 50-example subset of your calibration dataset monthly with the current judge version to measure deviation from the established standard.
Five numbers, read together, will outperform any single benchmark. The scorecard is a living product decision: as you learn what your users value, revise your rubrics and judge prompts (always validating against your ground-truth calibration dataset) to keep them aligned with your evolving definition of quality.
Limitations
- Lexical metrics break on paraphrase. BLEU and ROUGE penalise valid restatements that any human reviewer would accept.
- Semantic metrics can be fooled by fluent nonsense. BERTScore rewards plausible-sounding text whether or not it is true.
- LLM-as-judge drifts. Frontier model updates, prompt edits, and even API changes can shift scores by several points without any change in your system. This is not a bug; it is expected and measurable. Build a ground-truth calibration dataset at deployment; rescore it monthly with your judge. When drift, measured against this standard, exceeds tolerance, recalibrate the judge prompt or swap the judge model. A judge with a calibration baseline is better than a metric with no variance that you forget to validate.
- No metric replaces ground-truth calibration. Before deploying an LLM-as-judge, have humans judge a representative sample of 50–200 examples. This calibration dataset is your measurement standard. A weekly sample of 20 production outputs reviewed by the team will catch failure modes the scorecard never will and should be compared back to your calibration set. Use human review to validate that your judge still encodes your definition of quality; disagreement between judge and human is not a failure of the judge, it is a signal to refine your rubric or swap the judge model.
Evidence and sources
- ROUGE: Lin, C-Y. "ROUGE: A Package for Automatic Evaluation of Summaries." Wikipedia overview at https://en.wikipedia.org/wiki/ROUGE_(metric).
- BERTScore: Zhang et al., "BERTScore: Evaluating Text Generation with BERT." arXiv: https://arxiv.org/abs/1904.09675.
- MMLU benchmark, context on general-capability benchmarks: https://en.wikipedia.org/wiki/MMLU.
Related reading
- AI Evaluation for ML Engineers: Calibration, Judges as Code, and Failure-Mode Driven Test Design
- Tradeoffs Between Rule-Based Filtering and LLM Moderation
FAQ
Which metrics should I track first for my LLM app? Start with three, all normalized to 0–1: one deterministic task-specific scorer (F1, exact match, or schema validation), one LLM-as-judge tuned to the dominant open-ended property of your task (faithfulness for RAG, helpfulness for chat), and one operational metric (latency or cost). For the LLM-as-judge, build a ground-truth calibration dataset of 50–200 human-judged examples before deployment; document the rubric and version the prompt immediately; measure your judge's agreement with this gold set. Add a semantic similarity score (BERTScore or embedding cosine) once the first three are stable, and rescore your calibration dataset quarterly to measure judge drift.
When should I use reference-based vs. reference-free scoring? Use reference-based when you can write down the right answer and there are not too many equally good variants. Use reference-free, with an explicit rubric, when answers are open-ended or when the gold set would be expensive to curate. Most production stacks use both: reference-based on a small offline gold set, reference-free on sampled production traffic. For reference-free LLM-as-judge, treat the rubric as a versioned product artifact, not a static prompt.
How do I set reliable pass/fail thresholds in production? Tie the threshold to a user-visible outcome. "No more than 3% of RAG answers below 0.7 faithfulness over a 24-hour window" is actionable. "Faithfulness above 0.85" is not, because it ignores volume and tail behaviour. For LLM-as-judge thresholds, anchor them to your ground-truth calibration dataset: e.g., "0.75 faithfulness corresponds to the threshold where the judge agreed with humans at 87% on the calibration set at deployment." Replay the last 30 days of traffic and choose a level that would have flagged real incidents and not flagged benign noise. When you recalibrate the judge or update the rubric, rescore your calibration dataset and adjust thresholds accordingly.
Is perplexity still useful? For pretraining and fine-tuning sanity checks, yes. For evaluating a production LLM feature, almost never on its own; perplexity does not measure correctness, helpfulness, or task fit.
How do I know if my LLM-as-judge agrees with humans? This is the human-alignment question, and automated judge calibration is the answer. Before deployment, build a ground-truth calibration dataset: have humans judge 50–200 representative examples using your rubric. Measure your judge's agreement (Cohen's kappa or percent agreement) with these gold labels, then let an automated calibration loop tune the judge prompt, few-shot examples, and threshold until agreement clears a floor you set (for example, kappa > 0.85). Once per quarter, rescore a subset (25–50 examples) of your calibration dataset by hand and by the same judge version. If agreement falls below your baseline, the loop recalibrates the prompt or you investigate model drift. Treat this calibration as an ongoing operational responsibility: your calibration dataset is the standard against which all future judge behavior is measured. The point is that alignment stops being a leap of faith and becomes a number you maintain, which is exactly why a calibrated judge is more trustworthy than a classical metric that cannot be aligned to anything.
Should I use a classical metric or an LLM-as-judge? Default to the judge for anything open-ended (summaries, RAG answers, chat, agent output), because classical metrics cannot measure correctness, faithfulness, or usefulness. Keep classical metrics only where they are genuinely decisive and cheap: exact match or F1 for closed-form extraction and classification, schema validation for structured output, perplexity for a pretraining sanity check. Even then, treat the classical score as a fast pre-filter and let the calibrated judge make the final quality call. If you can only afford to maintain one metric well, maintain the judge.