Updated: 2026-03-01 By: Ari Heljakka
Short answer
Human feedback only becomes a useful metric when it is collected against a clear rubric, scored on a scale chosen for the signal, normalised to 0 to 1 so it can compose with other metrics, and tracked for inter-rater reliability. Without that discipline, raw thumbs-up counts and free-text comments produce a noisy aggregate that cannot drive deployment decisions. With it, human feedback becomes a calibration source for automated evaluators, a ground-truth dataset for regression gates, and a continuous signal that anchors every other measurement in the loop.
Key facts
- Definition: Human feedback metrics are structured measurements derived from human ratings of system outputs, scored against an explicit rubric and normalised to a comparable scale.
- When to use: Whenever automated scores need calibration, whenever a new quality dimension is introduced, whenever incident or appeal data is the source of truth for a regression case.
- Limitations: Human ratings carry variance and bias; without inter-rater reliability tracking, individual differences masquerade as system drift; scale fatigue degrades signal quality on long surveys.
- Example: A summarisation feature scored on three orthogonal dimensions (faithfulness, completeness, clarity) by three independent raters per item, with Cohen's Kappa reported per dimension and a 0 to 1 composite gating production.
Key takeaways
- The rubric is the artifact. Without a written rubric, "human feedback" is opinion, not measurement.
- Decompose quality into independent dimensions before designing the rating scale; collapsing everything into one number throws away the signal that drives fixes.
- Pick the scale for the dimension: binary for unambiguous categories, three-to-five points for routine grading, longer scales only when raters can use the extra granularity.
- Inter-rater reliability (Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha) is the meta-metric. Drops in reliability often mean a rubric ambiguity, not a system change.
- Human feedback's highest-value use is as calibration ground truth for automated evaluators. Continuous comparison between human and judge scores is what keeps the loop honest.
Definition
Human feedback metrics are quantitative measurements produced from human judgements of system outputs. A working metric requires four things: a rubric that defines what is being judged, a scale on which the judgement is recorded, a normalisation that turns scale values into a comparable 0 to 1 score, and a reliability metric that tracks how consistently independent raters apply the rubric.
This separates what to measure (the rubric) from who measures (the rater pool) from how to express the measurement (the scale and normalisation). The separation is what makes human feedback composable with automated evaluators, comparable across time, and resistant to rater drift.
When this matters
Structured human feedback is critical when at least one of these holds:
- Calibration of automated evaluators. Any LLM-as-judge needs a labeled set to calibrate against; human feedback produces it.
- Introduction of a new dimension. A new quality dimension (faithfulness for RAG, instruction depth for assistants, brand voice for content) does not have a public benchmark; human labels are the only source of truth at the start.
- Incident review. When a production failure surfaces, the labeled trajectory that becomes the regression case usually carries human-reviewed dimension scores.
- Appeals and disputes. User-driven appeals are a free source of high-quality labels and a place where rater inconsistency damages trust.
- Long-running surfaces. When a system runs for years, the live distribution drifts; human feedback is the anchor that keeps the evaluator panel calibrated against current reality.
Unstructured human feedback (thumbs-up counts, free-text comments) is still useful as a signal, but it cannot drive deployment decisions on its own.
How it works
A working human-feedback metric pipeline has six stages.
Stage 1, define quality dimensions
Decompose what "good" means into independent dimensions. Each dimension should be orthogonal (no overlap, no double-counting). Typical sets:
- For summarisation: faithfulness, completeness, clarity, conciseness.
- For RAG: factuality, citation correctness, helpfulness, off-topic rate.
- For assistants: instruction following, helpfulness, tone, safety.
- For content generation: brand voice, factual accuracy, format compliance.
The dimensions are the objective the system must satisfy; they are independent of any specific prompt, model, or implementation. Write them down before the rubric.
Stage 2, write the rubric
For each dimension, write a short, concrete rubric: what counts as a 1 (or a 5, or a pass), what counts as a 0 (or a 1, or a fail), and an example for each anchor. Keep rubrics short. Long rubrics produce checklist fatigue; raters skim, agreement drops. A short set of clear dimensions applied consistently tends to produce cleaner signal than a long checklist applied inconsistently.
The rubric is a versioned artifact. Treat it the same way you treat code: review it, lock it, change it on a documented cadence, and tag every dataset version with the rubric version it was scored against.
Stage 3, pick the rating scale
Match the scale to the dimension:
- Binary (pass/fail). Use for unambiguous categories: policy violation, factual error, schema violation.
- Three-point (better, same, worse). Use for pairwise comparisons between system versions.
- Five-point Likert. Use for routine grading of subjective dimensions (clarity, helpfulness, tone). The sweet spot for rater throughput.
- Seven- or ten-point. Use only when raters demonstrably use the extra granularity. Otherwise the extra points add noise, not signal.
- Pairwise preference plus magnitude. Useful for A/B comparison of model versions; magnitude carries effect-size signal.
Whatever the scale, normalise to 0 to 1 before composing with other metrics. A five-point Likert score of 4 becomes 0.75. Pairwise wins become win rate, not raw counts. This normalisation is what makes human feedback compose with automated evaluator scores in a scorecard.
Stage 4, design the workflow
A working workflow includes:
- Per-item routing. Items are routed to raters with explicit anti-collusion (the same item is scored by N independent raters before composition).
- Stratified sampling. The rated set is stratified across slices (common, edge, adversarial; per-surface; per-customer-segment) so aggregate scores carry interpretable signal.
- Hybrid automation. Routine items get automated evaluator scores; only items flagged as ambiguous, high-stakes, or above a complexity threshold get human review. The hybrid keeps human throughput affordable without losing the calibration anchor.
- Feedback loop into the dataset. Every human-rated item joins the ground-truth dataset with its rubric version and rater identity recorded. The dataset is the persistent artifact; individual rating sessions are not.
Stage 5, track inter-rater reliability
Without a reliability metric, individual rater differences look identical to system drift. Track:
- Cohen's Kappa for two raters per item.
- Fleiss' Kappa for three or more raters per item.
- Krippendorff's Alpha when rater counts vary per item or scales are mixed.
Set a threshold per dimension (commonly 0.6 to 0.8). Drops below the threshold are a rubric signal first (ambiguity in the rubric is producing inconsistent judgements) and a rater signal second (a rater has drifted or a new rater is mis-calibrated). Re-write the rubric or re-train the raters; do not silently average across the noise.
Stage 6, compose and gate
Per-dimension 0 to 1 scores compose into a scorecard with documented weights. Per-dimension floors gate deployment. The composite score is the headline; the per-dimension floors are the regression gates that actually prevent bad releases. Composing without floors is what produces "average is fine, ship it" deployments that ship known-bad regressions on a single dimension.
Example
A multi-turn assistant feature. The human-feedback pipeline:
- Dimensions. Four orthogonal axes: instruction following, helpfulness, factuality, tone. Each scored 0 to 1.
- Rubric. One paragraph per dimension with three anchor examples (fail, marginal, pass). Versioned in the same repo as the application code; reviewed quarterly.
- Scale. Five-point Likert for instruction following, helpfulness, and tone; binary for factuality (factual error or not). All normalised to 0 to 1 before composition.
- Workflow. Three raters per item. Routine items (above a confidence threshold from the automated panel) sampled at 10 percent. Low-confidence items reviewed at 100 percent. Adversarial slices reviewed by senior raters only.
- Reliability. Fleiss' Kappa computed per dimension per week. Threshold 0.7 per dimension. A drop on "tone" last quarter traced to a rubric ambiguity around informal phrasing; the rubric was tightened, Kappa recovered.
- Composition. Per-dimension scores composed with documented weights (factuality 0.4, instruction 0.25, helpfulness 0.2, tone 0.15) into an aggregate. Per-dimension floors set as 0.95 for factuality, 0.85 for instruction following, 0.75 for the others. Any floor failure blocks the deploy; the aggregate alone does not gate.
The same rubric anchors the LLM-as-judge calibration. The judge's per-dimension scores are compared against human scores weekly; when judge agreement on any dimension drops below 0.85, the judge is recalibrated against the human-labeled slice before it is trusted to gate alone.
Limitations
- Rater fatigue is real. Long sessions, long rubrics, and long scales all degrade signal. Keep sessions short, dimensions few, scales matched to the dimension.
- Cohort bias. A homogeneous rater pool produces ratings biased toward the cohort's defaults. Diversify when possible; document the cohort either way.
- Self-reported reliability is suspect. Raters' confidence is a poor proxy for rater accuracy. Use independent re-rating on a sample as a ground-truth check.
- Net Promoter Score and similar aggregates do not decompose. They are useful as a top-level user-experience signal but cannot be used as a per-dimension regression metric.
- Human feedback is slow. Even the best workflow runs at human throughput. Use it as calibration ground truth for automated evaluators that handle the volume; do not try to scale it to every trace.
- Drift in the rubric itself. When a rubric is updated, historical scores are no longer directly comparable. Tag every dataset version with its rubric version; recompute baselines after a rubric change.
Evidence and sources
- Wikipedia, "Cohen's Kappa," https://en.wikipedia.org/wiki/Cohen%27s_kappa, for the standard two-rater agreement metric.
- Wikipedia, "Fleiss' Kappa," https://en.wikipedia.org/wiki/Fleiss%27_kappa, for the multi-rater extension.
- Krippendorff, "Content Analysis: An Introduction to Its Methodology," https://en.wikipedia.org/wiki/Krippendorff%27s_alpha, for the variable-rater, mixed-scale alternative.
Numeric figures in this post (Kappa thresholds, sampling rates, weights) are illustrative; calibrate against your own workload and rater pool before using them.
FAQ
How many raters per item? Two raters give a Cohen's Kappa. Three give a Fleiss' Kappa and a tie-breaker. Most production workflows use three; high-stakes adversarial slices use five.
What scale should I default to? Five-point Likert for routine subjective grading; binary for unambiguous categories; pairwise for A/B comparisons between system versions. Avoid scales longer than seven points unless raters demonstrably use them.
How do I combine human scores with LLM-as-judge scores? Treat both as estimators of the same per-dimension rubric. Compute judge-human agreement on a labeled slice. When agreement is high, let the judge handle the bulk of the volume; route ambiguous and high-stakes items to humans.
What if my raters disagree a lot? Diagnose the rubric first, the raters second. Ambiguous anchors and overlapping dimensions are the most common cause of low Kappa. Tighten the rubric and retrain raters before changing the rater pool.
Can I use thumbs-up counts as a metric? As a signal, yes. As a deployment-gating metric, no. Thumbs-up data is biased toward engaged users, sparse, and not dimension-decomposed. Use it to surface clusters that deserve structured human review.
How often should baselines and targets be refreshed? Whenever the rubric changes, whenever the rater pool changes meaningfully, and on a fixed cadence (monthly or quarterly) regardless. Stale baselines silently produce false-positive drift alerts.