AI Evaluation for ML Engineers: Calibration, Judges as Code, and Failure-Mode Driven Test Design

AI Evaluation for ML Engineers: Calibration, Judges as Code, and Failure-Mode Driven Test Design

By: Ari Heljakka

Short answer

For an ML engineer working on LLM-powered features, AI evaluation is a software-engineering discipline applied to probabilistic systems. The methodology has four invariants: derive the failure-mode taxonomy from production traces before choosing metrics; pick the lightest evaluator that can detect each failure class (rule, classifier, or LLM-as-judge); calibrate every judge against a human-labelled slice using Matthews correlation coefficient (MCC); and treat the entire eval suite as code, with versioned ground-truth datasets, versioned evaluator prompts, and CI gates that block bad merges. Done this way, your eval is reproducible, model-agnostic, and falsifiable, which is what separates engineering from vibes.

Key facts

  • Definition: A production-grounded eval system is a set of versioned evaluators, each scored against a human-labelled calibration set, run on demand against versioned datasets and on a sampled stream of production traces.
  • When to use: From the first prompt change after launch. Pre-deployment test sets cannot anticipate the failure modes real users surface.
  • Limitations: Judge calibration drifts with model updates. Annotation throughput is the bottleneck. MCC below 0.4 means the judge is not useful as a gate.
  • Example: An ML engineer samples 50 production traces, identifies 8 failure modes, builds rule checks for 3 and LLM judges for 5, calibrates each judge to MCC ≥ 0.6, and wires the panel into CI with per-dimension regression tolerances.

Key takeaways

  • Failure modes first, metrics second. The most common mistake is selecting evaluators before observing what breaks in production.
  • The same objective can have multiple implementations. Rule checks, classifiers, and judges are interchangeable as long as they clear the alignment gate on the same calibration set.
  • MCC is the right alignment metric. Accuracy is misleading on the imbalanced label distributions typical of production LLM data.
  • Minimum annotation sample sizes are non-negotiable for blocking gates. Smaller label sets can still bootstrap monitor-only evaluators if their uncertainty is explicit.
  • Offline and online evaluation are the same primitives applied to different data sources. The seam is the dataset registry, not the evaluator catalog.
  • Treat judges and datasets as versioned artifacts. A score change has exactly two possible causes; both should be inspectable from a hash.

Definition

A production-grounded eval system has five engineering properties:

  1. Failure-mode-driven. Every evaluator targets a named failure mode observed in production traces. Anonymous "quality scores" are not part of the system.
  2. Composable. Each evaluator scores a single dimension (faithful, helpful, format-correct, safe), normalised to 0 to 1. Scorecards compose dimensions; evaluators do not double-count.
  3. Calibrated. Each evaluator has a human-labelled calibration set and a measured alignment metric (MCC). Below a threshold, the evaluator is a draft, not a gate.
  4. Versioned. Evaluator prompts, classifier weights, rule definitions, and calibration datasets are all versioned. Score changes are attributable to a specific change.
  5. Operational. Evaluators run in two contexts: offline against a fixed dataset (for CI and release gates) and online against sampled production traces (for drift detection).

The objective (catch failure mode X) is independent of the implementation (which rule, classifier, or judge does the catching). The implementation is allowed to evolve; the objective and its calibration set are the durable contracts.

When this matters

The discipline pays off most clearly when:

  • You are about to change a prompt, a model, or a retrieval step. Without a calibrated gate, you cannot tell whether the change improved or regressed quality on real failure modes.
  • You are switching providers or model versions. Model-agnostic objectives are the only way to make the comparison fair.
  • Your team has shipped a quality incident. Naming the failure mode and attaching a judge is the cheapest way to ensure it does not silently recur.
  • You need to defend an experiment result. A score difference without a calibration metric is anecdote; with MCC and a hash-pinned dataset, it is evidence.
  • You are about to fine-tune. A calibrated eval suite is what tells you whether the fine-tune helped, hurt, or just moved noise around.

How it works

Stage 1, Failure-mode taxonomy from production

Sample 50 to 100 production traces. Two reviewers (an ML engineer plus a domain expert) tag every observable failure. Cluster the tags into named failure modes. A useful starter set for an agentic system:

  • Tool selection error (wrong tool for the task)
  • Tool response misinterpretation (right tool, wrong reading)
  • Constraint violation (ignored a policy or schema)
  • Goal drift (answer is plausible but off-objective)
  • Hallucination (unsupported factual claim)
  • Scope violation (out-of-domain answer)
  • Context loss (forgot earlier turn or constraint)
  • Premature termination (stopped before completing the task)

Each failure mode gets a one-sentence definition, a worked example, and a severity (safety > task-completion > polish). The taxonomy is the spine of the eval system; without it, every later choice is arbitrary.

Stage 2, Pick the lightest evaluator per failure mode

Failure characteristicEvaluator type
Deterministic (format, tool name, schema, presence of citation)Rule check or assertion
Pattern-matchable with labelsSmall classifier (calibrated on labelled traces)
Semantic, rubric-expressibleLLM-as-judge with a written rubric
Multi-signalComposite: rule + classifier + judge, combined at the scorecard layer

The bias is toward the cheapest evaluator that clears the alignment gate. A regex that hits MCC 0.85 on schema violations is preferable to a judge that costs 100x as much and hits 0.65.

Stage 3, Judge prompt design principles

When an LLM-as-judge is the right tool, the prompt is itself a versioned artifact subject to the same engineering discipline as production code. The principles that survive contact with real labels:

  • Single dimension per call. A judge that scores faithful, helpful, and safe in one call double-counts and produces inseparable signal. Decompose.
  • Low-ordinal output. Prefer a compact numeric scale (1 to 3 or 1 to 5) so partial improvement is visible; use pass/fail only for binary, unambiguous failures. Numeric scales beyond 5 are usually noise.
  • Explicit rubric. State the failure mode definition inline, with one positive and one negative example. Avoid "use your judgement."
  • Position-invariant. If the judge sees pairs, randomise order. Position bias is a known artifact.
  • Length-invariant. Strip cosmetics; do not let the judge reward verbosity.
  • Justification before score. Ask the judge to articulate the failure before emitting the score; reduces sycophancy.
  • Locked seeds and temperature 0. Determinism is part of reproducibility.

Stage 4, MCC calibration against human annotations

Calibration is the gate between draft and operational. For each evaluator on each failure mode:

  1. Hold out a labelled slice (typically 20% of the labelled set).
  2. Run the evaluator against the held-out inputs.
  3. Compute Matthews correlation coefficient between evaluator output and human label.
  4. Apply the alignment threshold per severity:
    • Critical failure modes: MCC ≥ 0.7 to gate, ≥ 0.4 to monitor.
    • High severity: MCC ≥ 0.6 to gate.
    • Medium severity: MCC ≥ 0.5 to gate; lower is monitor-only.

Why MCC: production label distributions are imbalanced (most outputs are fine; the failure mode you care about may be 2% of traffic). Accuracy is misleading; a judge that says "no failure" 100% of the time will score 98% accuracy and zero MCC. F1 is better but still asymmetric; MCC penalises constant predictions correctly and is symmetric in the confusion matrix.

Stage 5, Minimum annotation sample sizes

Statistical power, not vibes, sets the floor for a blocking release gate. For a binary judge against a held-out slice:

Population pass rateMinimum labelled examples per failure mode
80/20 split100
90/10 split200
95/5 split500
99/1 split2,000 (or active learning to upsample positives)

For low-frequency failure modes, oversample anomalies into the calibration set rather than uniformly sampling traffic. If you have fewer labels than this, still start: publish the evaluator as draft or monitor-only, document the uncertainty, and promote it to a gate only after the calibration slice has enough power.

Stage 6, Offline and online architecture

The eval system has two execution surfaces sharing one set of evaluators.

  • Offline. Each evaluator runs against a versioned dataset snapshot (immutable hash, creation timestamp, sampling provenance). CI runs the panel; release gates compare the new score to the baseline with a per-dimension regression tolerance.
  • Online. A sampled fraction of production traces is re-scored by the same evaluators. Per-failure-mode rates feed a dashboard with alerts on drift. Sampling is stratified: 100% of safety-relevant events, 10 to 30% of semantic-quality events, 5% of secondary-dimension events.

The seam between the two is the dataset registry. Online failures get routed into the annotation queue; new labels materialise into the next dataset snapshot; offline evaluators are re-calibrated against the new snapshot. The loop is what makes the suite compound.

Stage 7, Detecting genuine regressions

A score difference between two runs is not automatically a regression. Apply two filters before alerting:

  • Significance. A Welch's t-test (or equivalent) on per-trace scores, with the per-dimension regression tolerance pre-declared. A 0.02 drop on a 200-sample dataset is noise; a 0.05 drop on a 2,000-sample dataset may not be.
  • Effect size. Cohen's d above the per-dimension threshold. Statistically significant moves below the effect-size floor are still not interesting.

Both filters live in the gate logic, not in the evaluator. The evaluator emits scores; the gate decides whether a score change is actionable.

Example

An ML engineer owns a customer-support agent that recently shipped its first prompt update post-launch. The eval system at the start of the iteration:

  • Trace sample. 80 production sessions; review surfaces 6 failure modes, of which 2 are critical (
    ,
    ).
  • Evaluator design. Rule check for citation presence (deterministic), small classifier for off-topic detours (pattern-matchable), three LLM judges (
    ,
    ,
    ).
  • Calibration. 200 labelled examples per critical failure mode, 100 per high-severity. Held-out MCC: 0.78, 0.71, 0.62, 0.58. The first three gate; the last one is monitor-only.
  • Dataset snapshot.
    , immutable, hash-pinned.
  • CI gate. Per-dimension regression tolerance: 0.03 drop on critical failure modes is blocking; 0.05 drop on high-severity is blocking; advisory below that.
  • Prompt change. The new prompt improves overall pass rate from 0.84 to 0.87 but regresses on
    from 0.83 to 0.76. The gate fails with a structured diff and the 10 worst-failing examples linked.
  • Iteration. The engineer narrows the new instruction, re-runs the gate, lands the change with
    back to 0.84 and overall pass rate at 0.88.

The new prompt ships in a half day instead of a week of guesswork, and the regression that would have shipped silently is caught at merge time.

Limitations

  • Judge calibration drifts. A new checkpoint from the judge provider can shift alignment overnight. Re-measure MCC on every judge-model change; downgrade gates to advisory until re-calibrated.
  • Annotation throughput caps the suite. No amount of judge engineering substitutes for labelled examples. Plan for a sustained weekly annotation cadence per surface.
  • MCC has known weaknesses on tiny samples. Below ~50 labelled examples, MCC is noisy; treat early MCCs as directional, not gating, and grow the calibration set.
  • Rule checks scale linearly; judges scale by token cost. Sampling, caching, and rate-limiting are real engineering work; budget for them.
  • A single composite score is a lie. Resist the executive ask for "the eval score." Per-dimension scorecards expose tradeoffs; a single number hides them.
  • Eval suites can rot. A failure mode that has not recurred in 90 days is a candidate for archival; CI compute is finite.
  • Synthetic adversarials cannot replace production labels. They help bootstrap pre-launch and shore up under-represented slices; they do not substitute for real distributions.

Evidence and sources

Evidence cap reached at three links. Additional reading:

FAQ

Why MCC instead of F1 or accuracy? Accuracy is misleading on imbalanced production data; a degenerate "always pass" judge scores high. F1 is asymmetric across positive and negative classes. MCC uses every cell of the confusion matrix symmetrically and penalises constant predictions, which is exactly the failure mode you most need to catch.

How small can the calibration set be? Statistical power, not preference, sets the minimum. 100 labelled examples for an 80/20 split is a usable floor; below 50, MCC is unstable. For rare failure modes (under 1% of traffic), oversample anomalies into the calibration set rather than draw uniformly.

What is the right sampling rate for online evaluation? Stratified, not uniform. A common pattern: 100% of safety-relevant events, 10 to 30% of semantic-quality events, 5% of secondary-dimension events, and 100% of anomaly-flagged sessions. Tune per failure mode based on cost and signal value.

Can I use the same judge for offline and online? Yes; that is the design. The seam between offline and online is the dataset, not the evaluator. Sharing evaluators keeps the calibration relevant to both surfaces.

What if my judge regresses when the provider ships a new model? Treat the model swap like any other infrastructure change: re-run calibration, recompute MCC, downgrade affected gates from blocking to advisory if alignment drops below threshold, and ship the re-calibrated judge as a new evaluator version. Model agnosticism in the objective layer is what makes this swap tractable.

Is LLM-as-judge always the right tool? No. Use it only when the failure mode is semantic and rubric-expressible. For deterministic problems (schema, tool name, citation presence), use a rule check; it is cheaper, faster, and easier to calibrate. The judge is the last resort, not the default.

How do I prevent judges from rewarding longer responses? Strip cosmetics in the prompt, evaluate against a length-balanced calibration slice, and check the score-length correlation as a calibration metric in its own right. A judge whose scores correlate 0.4 with output length is mis-rewarding verbosity.

Related reading