Updated: 2026-01-22 By: Ari Heljakka
Short answer
Generic evaluators ("is this toxic? is this hallucinated? is the answer long enough?") catch generic failures. Real production failures are usually product-specific: a pricing agent that returns correct prices for the wrong SKU, a support agent that answers a different question than the one asked, a planning agent that picks the wrong tool with technically valid arguments. Annotation queues bridge the gap. They sample production traces, route them to human reviewers, cluster the resulting labels into named failure modes, and feed those labels back as the calibration set for a specific evaluator that monitors that failure mode on every future trace. The annotation queue is the monitoring system; the evaluator is the diagnosis.
Key facts
- Definition: An annotation queue is a structured backlog of production traces that human reviewers label, comment on, and cluster into failure modes. Each labeled cluster becomes a calibration set for a managed evaluator targeted at that specific failure.
- When to use: Whenever generic evals are passing but users keep complaining, or whenever you launch a new agent surface and need to discover its actual failure modes before you can monitor them.
- Limitations: Annotation queues need real, sampled traffic; they do not work on synthetic data alone. They also need disciplined review hygiene: ambiguous labels poison the resulting evaluator just as ambiguous training labels poison a model.
- Example: A pricing agent passes a generic "answer matches a price in the catalog" evaluator at 0.97 but users keep complaining. Annotation queues surface a cluster of "right price, wrong variant" traces. The team writes a dedicated SKU-match evaluator calibrated on those labeled examples, and it begins running on every production trace.
Key takeaways
- Failure modes are product-specific. Generic evaluators are necessary but not sufficient; they do not know what your users care about.
- The route from "we have a problem we cannot name" to "we have an evaluator that catches it" runs through human annotation of real traces.
- Annotation queues, evaluators, and rubrics are all versioned artifacts. So is the ground truth dataset they produce. Treat them as code, not as one-off spreadsheets.
- The evaluator that comes out of an annotation queue should be calibrated against the human labels and re-calibrated whenever the underlying model or prompt changes.
- Monitoring and diagnosis are different jobs. Generic eval sampling and annotation queues monitor; specific evaluators built from that data diagnose.
Definition
An annotation queue is a managed queue of production traces (full request and response pairs, often with intermediate tool calls and reasoning steps) that human reviewers inspect, label, and comment on. Labels are typically structured: a pass or fail flag, a failure category, a free-text reason, and sometimes a corrected reference output. Reviewers might be in-house engineers, domain experts, or contractors.
A bridge from annotation queues to evaluators is the workflow that takes the labeled traces and turns them into calibration data for a new managed evaluator. Each cluster of labels with a shared failure pattern becomes the seed for one specific evaluator. The evaluator's objective is the cluster's name; the evaluator's implementation (LLM judge, learned classifier, deterministic rule) is chosen separately and can change without changing the objective.
Together these form a closed loop:
- Sample production traces.
- Human reviewers label them.
- Cluster labels into named failure modes.
- Build a specific evaluator for each cluster, calibrated on the labeled traces.
- Run the new evaluator continuously on production; surface its alerts in the same annotation queue, ready for the next round of labeling.
When this matters
Three situations push teams to invest in annotation queues:
- Generic evaluators are green and users are red. The clearest signal that your evaluation suite is missing product-specific failure modes. Generic dimensions cover universal concerns (safety, faithfulness, refusal correctness); they say nothing about whether you served the right SKU.
- New agent surfaces with unknown failure modes. When you ship a new tool-using agent, you do not yet know how it will fail. You cannot write an evaluator for a failure mode you have not seen. Annotation queues are the discovery mechanism.
- Drift after a model swap or prompt change. A change that passes generic evaluators can still introduce a new product-specific failure mode that nobody anticipated. Annotation queues catch the long tail.
How it works
Stage 1: Sample production traffic
Capture full traces (input, intermediate steps, output, latency, cost) and sample 5 to 10 percent into the annotation queue. Sampling can be uniform or biased: route traces where generic evaluators were uncertain, where user feedback was negative, or where the agent took an unusual tool path. Biased sampling discovers failure modes faster; uniform sampling avoids missing the silent ones.
Stage 2: Human review and labeling
Reviewers see each trace with a structured form: a pass or fail flag, a free-text reason, optional category tags, and optional corrected output. Two practices keep label quality high:
- Calibration sessions. New reviewers label the same warm-up set as senior reviewers. Inter-rater agreement is the first quality gate; a reviewer who disagrees with the consensus stays in training.
- Disagreement routing. Traces with reviewer disagreement go to a senior reviewer for adjudication. The adjudicated label, plus the reasons given by each reviewer, becomes a high-value calibration example.
Stage 3: Cluster into failure modes
Labels accumulate. Periodically (weekly is typical for active products), cluster the failed traces by:
- Free-text reason similarity (embedding-based).
- Common substrings in the user input or model output.
- Common tool-call patterns.
- Common surface area (e.g. all checkout-related queries).
Each meaningful cluster becomes a named failure mode: "wrong SKU for correct product family," "refused a benign question," "answered a different question than asked." Naming matters; an unnamed cluster is not a monitorable failure mode.
Stage 4: Build a specific evaluator per cluster
For each named cluster, the labeled traces become the calibration dataset for a new managed evaluator. The evaluator's:
- Objective is the cluster's name (e.g. "SKU match accuracy"), expressed independently of how it will be measured.
- Implementation is chosen for the failure mode. Some clusters are best handled by a deterministic rule, some by a learned classifier, some by an LLM judge prompted against a written rubric.
- Calibration is measured as agreement with the human labels. The evaluator must reach an agreement threshold on a held-out slice of the labeled traces before it goes live.
- Output is a 0 to 1 score on each future trace, plus a written justification when the implementation is an LLM judge.
Each evaluator is one dimension of the overall quality scorecard. Composition is additive: a product can have ten specific evaluators running, each scoring its own dimension, none overlapping.
Stage 5: Close the loop
The specific evaluator now runs on every production trace. When it fires, the trace returns to the annotation queue with the evaluator's verdict attached. Human review confirms or contradicts. Confirmations grow the calibration set; contradictions trigger re-calibration. Over time:
- The evaluator's agreement with humans climbs as the calibration set grows.
- New failure modes emerge from the residual cluster (everything the existing evaluators did not catch).
- Old failure modes shrink, and their evaluators move from "alert me" to "block in CI."
This loop is the same loop used in eval-driven AI observability for any other production model behavior. The difference is the entry point: annotation queues provide the labels that let you start the loop in a product where you do not yet know what to measure.
Example
A pricing agent for a hardware retailer:
- The generic evaluators. "Answer references a price that exists in the catalog" (score: 0.97). "Answer is on topic" (score: 0.95). "Answer is not toxic" (score: 1.00). All green.
- The complaint pattern. Customer support reports a steady trickle of "the bot quoted me $129 but the part it described is $189." Generic evaluators cannot see this; they only check that the quoted price is somewhere in the catalog.
- Annotation queue. Five percent of pricing traces are routed to two domain reviewers (product specialists). They label each trace with a SKU-match flag, a free-text reason, and the corrected SKU when applicable.
- Cluster. After two weeks, the reviewers have flagged 73 traces as SKU-mismatch failures. The free-text reasons cluster around "agent picked the base variant when user described the pro variant" and "agent picked the latest version when user described the older version."
- Specific evaluator. A new evaluator is defined with the objective "SKU match accuracy: the SKU referenced in the agent's response matches the variant the user described." The implementation is an LLM judge with a structured rubric and access to the catalog. Calibration against the 73 labeled traces (plus 200 passing ones) reaches 0.91 agreement.
- In production. The SKU-match evaluator runs on every trace, alongside the three generic evaluators. Its 0 to 1 score sits in the dashboard. When it dips, a versioned alert routes the offending trace back to the annotation queue.
- CI gate. Once the evaluator's agreement with humans stabilizes above 0.90 for a month, it becomes a deployment gate: prompt changes that regress SKU match accuracy below threshold block release.
The product now monitors a failure mode that did not exist as a measurable concept before the annotation queue surfaced it.
Limitations
- Annotation latency. Human review takes time. The fastest the queue can produce a new evaluator is the calibration cycle (typically days to weeks). For incidents that need immediate response, the queue is not the right tool; a quick rule or filter is.
- Label noise. Two reviewers can disagree on borderline traces. Without calibration sessions and adjudication, the resulting evaluator inherits the noise.
- Coverage holes. Sampling at 5 to 10 percent misses rare failure modes. Bias sampling toward suspicious traces helps, but a sufficiently rare failure may never reach a reviewer until it becomes common.
- Reviewer cost. Domain expert review is the most expensive part of the loop. The economic case rests on the cost of the failures the evaluator will prevent, not on the abstract value of "more labels."
- The queue is not a substitute for generic evaluators. Generic safety, faithfulness, and refusal evaluators still run; they catch universal concerns. The annotation queue covers the product-specific layer on top.
Evidence and sources
- "AI Engineering" by Chip Huyen, 2024, https://huyenchip.com/books/, for the design of annotation workflows and inter-rater agreement.
- "Designing Data-Intensive Applications" by Martin Kleppmann, https://dataintensive.net/, for the general pattern of queues as decoupling components between asynchronous producers and consumers.
- The Evals open-source repository, https://github.com/openai/evals, for one widely studied implementation of the structured evaluator pattern that calibration loops feed into.
Evidence cap reached. For deeper analysis, see the related reading section.
FAQ
Why not just rely on generic evaluators? Generic evaluators measure universal concerns (safety, faithfulness, length). They are necessary baseline coverage but they cannot know your product's specific failure modes. The annotation queue is how you discover and then monitor those failure modes.
Who should staff the annotation queue? Domain experts for product-specific dimensions; engineering staff for technical correctness; safety reviewers for policy. The right blend depends on the failure modes you are chasing. What matters is calibration: every reviewer's labels must agree with the consensus above an explicit threshold before they count.
How big does the calibration set need to be? Enough to reach your agreement target on a held-out slice. For narrow rubrics, 50 to 200 labeled examples often suffices. For broad rubrics, low hundreds to low thousands. Track agreement as the set grows; stop growing when agreement plateaus.
What if reviewers disagree with the evaluator? That disagreement is the signal. Every confirmed disagreement either improves the evaluator (added to the calibration set, used to re-calibrate) or improves the rubric (the evaluator's stated objective was ambiguous and needs refinement). Both are first-class outputs of the loop.
How does this differ from a normal evaluator pipeline? A normal evaluator pipeline assumes you already know what to measure. Annotation queues are how you discover what to measure when you do not. They sit upstream of the evaluator pipeline, feeding it product-specific calibration data.