Updated: 2026-03-24 By: Ari Heljakka
Short answer
Production agents that look healthy in demos and fail in production almost always fail at the context layer, not at the model layer. Upgrading the model rarely helps; the system needs the right facts in the right place at the right time. Context evaluation is the discipline of measuring three things independently: retrieval quality (did the system pull the right context), context window usage (was the retrieved context placed where the model could attend to it), and context-utilization (did the model's response actually use the supplied context faithfully). All three score 0 to 1; all three are orthogonal; none of them is the model's job.
Key facts
- Definition: Context evaluation is the scoring of an agent's context layer (retrieval, window composition, and utilization) as a set of orthogonal dimensions, independently from the model's reasoning quality.
- When to use: Any production agent that depends on retrieved data, structured tool responses, or session memory to answer correctly. Most production agents.
- Limitations: Requires a versioned ground-truth dataset that includes the expected context, not just the expected answer; calibration effort per dimension; not a substitute for end-to-end evaluation.
- Example: A supply-chain agent's task-success rate stays flat at 0.71 despite three model upgrades; context-utilization scoring shows the retrieved context was right, but the agent ignored half of it; the fix is in the prompt and the window composition, not the model.
Key takeaways
- The model is rarely the bottleneck in production agents. Context is.
- Three orthogonal dimensions: retrieval quality, window composition, utilization. Score them separately.
- "The model answered the question you wish you asked, not the one you asked" is a context-utilization failure, not a model failure.
- Retrieval is separable from reasoning. Architect for the separation; evaluate at the seam.
- Context evaluation is what makes model swaps measurable. Without it, an upgrade is a vibe-driven change.
Definition
Context in a production agent is the union of: retrieved documents or rows, structured tool responses, conversation memory, system instructions, and any other state the model needs to produce a correct answer. Context is supplied to the model; it is not generated by the model.
Context evaluation is the disciplined scoring of the context layer on three orthogonal dimensions:
- Retrieval quality. Did the system pull the right things from the corpus?
- Window composition. Was the retrieved context placed and ordered in a way the model could attend to?
- Context utilization. Did the model's response use the supplied context faithfully, and use the right parts of it?
The three dimensions are measured separately so that a failure in one is not hidden by a success in another.
When this matters
Context evaluation matters most when at least two of these hold:
- The agent depends on retrieved documents, structured tool responses, or external data for correctness.
- End-to-end task success has plateaued and model upgrades produce diminishing returns.
- The agent occasionally produces confident-sounding answers that do not match the supplied data.
- The user-facing failures cluster around "the agent answered something related but not the question I asked".
- The team is debating whether to upgrade the model when the underlying gap is somewhere else.
If the agent does not consume retrieved or supplied context (a pure-generation use case with no grounding), context evaluation does not apply.
How it works
The discipline has three dimensions, one architecture pattern, and one feedback loop.
Dimension 1, retrieval quality
The first place an agent can fail. Retrieval quality scores whether the system pulled the right things from the corpus, independently of what the model did with them.
Sub-dimensions, each scored 0 to 1:
- Recall. Of the documents that contain the answer, what fraction did the retrieval pull?
- Precision. Of the documents pulled, what fraction were relevant?
- MRR or NDCG. Was the most relevant document ranked highly?
- Coverage. Across the dataset, what fraction of queries had at least one relevant document retrieved?
The signal lives in the gap. A retrieval system with 0.9 precision and 0.4 recall is leaving facts on the table; the agent has no way to answer correctly because the data never arrived.
Dimension 2, window composition
The second place an agent can fail. The right context retrieved badly is worse than less context retrieved well. Window composition scores the structure of the prompt that finally arrives at the model.
Sub-dimensions:
- Placement. Are the most relevant chunks placed where the model attends most strongly (often the start and the end of the window, not the middle)?
- Density. Is the window dense with relevant context, or padded with marginally related text that dilutes attention?
- Order. Are dependent chunks ordered so the model reads them in a coherent sequence?
- Saturation. Is the window approaching the model's effective attention limit, where added context starts to crowd out the original instructions?
Window composition is a prompt-engineering discipline that is invisible to a model-centric evaluation framework. Scoring it surfaces a class of failures that retrieval-quality and end-to-end metrics both miss.
Dimension 3, context utilization
The third place an agent can fail. Even with right retrieval and right composition, the model can ignore supplied context and answer from its training instead.
Sub-dimensions:
- Faithfulness. Are the claims in the response grounded in the supplied context?
- Coverage of supplied context. Did the response use the right parts of the supplied context, or skip the parts that actually matter?
- Attribution accuracy. If the model cites the source, does the citation match the supplied document?
- Refusal under insufficient context. When the supplied context does not support an answer, does the model decline rather than fabricate?
Context utilization is the dimension most prone to "the model answered the question you wish you asked". A confident, fluent, fully-attended-to wrong answer scores high on plain-text quality and low on faithfulness; the gap is the signal.
Architecture: separate retrieval from reasoning
A reliable agent architecturally separates the retrieval step from the reasoning step. Retrieval pulls structured payloads; reasoning consumes them. The seam between the two is where context evaluation lives.
The separation has two consequences:
- Retrieval is measurable independently. Recall, precision, and MRR are scored without involving the model.
- The model is evaluable as a consumer of context, not as a source of fact. Faithfulness scoring becomes meaningful.
A monolithic agent in which retrieval and reasoning are entangled cannot be evaluated this way. Refactoring for separation is often the first concrete engineering action a team takes after context-evaluation results surface a gap.
The feedback loop
Context-evaluation scores feed two things:
- Dataset growth. Cases where retrieval missed, where the window was poorly composed, or where the model ignored supplied context become regression cases in the ground-truth dataset. Each carries the expected context, not just the expected answer.
- Per-dimension gates. CI/CD runs the three dimensions on every prompt, retrieval-config, or model change. A drop on retrieval quality blocks the merge even if end-to-end task success looks stable.
The loop is what keeps the system honest as the corpus, the prompts, and the model change.
Example
A team operates a supply-chain decision agent across manufacturing, logistics, and retail. The baseline:
- End-to-end task success: 0.71.
- Two model upgrades in the past quarter; task success did not move.
- User complaints cluster around "the agent answered something related but not what I asked".
The team stands up context evaluation:
- Week 1. Retrieval quality scored on a 200-case dataset. Recall 0.84, precision 0.62. The retrieval is over-pulling near-duplicates.
- Week 2. Window composition scored. Median window saturation 0.78 of the effective attention limit. Relevant chunks ranked in the middle of the window more than half the time.
- Week 3. Context utilization scored. Faithfulness 0.79; coverage of supplied context 0.61; refusal under insufficient context 0.38 (the model fabricated rather than decline 62 percent of the time).
- Week 5. Retrieval re-ranker added; deduplication step added; window composition reordered to place the highest-ranked chunk last. Refusal prompt strengthened.
- Week 7. Recall 0.91, precision 0.81. Faithfulness 0.93. End-to-end task success 0.83.
The 12-point gain came from the context layer. The team did not change the model.
Limitations
Caveats worth flagging up front:
- Ground-truth datasets must include expected context. Without it, retrieval quality cannot be scored; the dataset is more expensive to build.
- Calibration per dimension is required. A faithfulness judge that disagrees with humans is producing noise. Track per-dimension agreement weekly.
- Window composition scoring is model-specific. Effective attention behavior varies by model family; the composition that works for one model may not for another.
- Context evaluation is not a substitute for end-to-end evaluation. It is a complement. Task success, plan coherence, and policy adherence still need their own scoring.
- The architecture has to support the seam. Monolithic agents need refactoring before context evaluation can be wired in cleanly.
Evidence and sources
- Reducing hallucinations in production, https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/reduce-hallucinations, for the faithfulness and refusal discipline behind context-utilization scoring.
- Retrieval guidance, https://platform.openai.com/docs/guides/retrieval, for the retrieve-then-reason architecture that the seam evaluation depends on.
- Evaluating AI Agents short course, https://www.deeplearning.ai/short-courses/evaluating-ai-agents/, for the context-as-first-class-dimension framing.
FAQ
Why not just measure end-to-end task success? End-to-end metrics tell you something is wrong; they do not tell you where. A drop in task success could be retrieval, window composition, utilization, the model, or the prompt. Per-dimension context evaluation pinpoints the layer.
Should retrieval be evaluated by an LLM judge or by classical IR metrics? Both. Classical metrics (recall, precision, MRR) score the retrieval system. An LLM judge can score the relevance of each retrieved chunk to the query when ground-truth relevance labels are sparse. Use the classical metric where you can; use the judge where the labels are not feasible.
How do I know if my context window is saturated? Two signals: the proportion of the model's effective attention budget the prompt is consuming, and the degradation of the model's performance as more context is added. A small adversarial test (same query, more padding) surfaces the saturation point empirically.
What does "refusal under insufficient context" actually measure? Of the cases where the supplied context does not support a correct answer, the fraction in which the model declines or signals uncertainty rather than fabricates. A score below 0.5 is a strong signal that the model is filling gaps from training data.
How does this relate to RAG evaluation? Context evaluation is the broader practice; RAG evaluation is the special case where the context comes from a retrieval-augmented pipeline. Tool responses, conversation memory, and system instructions are all context too. The dimensions are the same.