Evals are your competitive edge. If you can measure whether your AI feature is getting better or worse, you can iterate with confidence. Swapping models, tweaking prompts, changing retrieval logic are scary things to do but if you can measure the impact, a bit less so. Without some kind of an evaluation layer, every change is a leap of faith.
The corollary: if evals are that valuable, they probably shouldn't live inside a vendor's dashboard that you can't control, export, or extend. They should be yours.
But how hard is it actually to build them yourself? That question is worth stress-testing with a real example rather than theorizing.
The Case Study
The example app is a simple RAG agent (
) that answers questions about the Scorable documentation. It:
- Takes a user question
- Creates an embedding and retrieves the top 8 matching doc sections from pgvector
- Passes those sections to an OpenAI model to generate an answer
Granted, yet another toy example. The point being, it is the kind of thing that ends up in production and then gets "improved" until something quietly breaks.
What We'll Do
Two passes at adding evals to this app:
Pass 1: Vibe coded. Use Claude to write LLM-as-judge scorers from scratch. No eval framework, just prompts and Python. See how far we get and what the results look like.
Pass 2: Managed solution. Add Scorable with pre-built evaluators. Same questions, same agent, different infrastructure.
Then we'll compare:
- How much effort each approach took to set up
- Whether the scores and insights differ meaningfully
- What you'd have to build on top of the DIY version to get the same value long-term (score history, regression detection, dashboards...)
The hypothesis going in: the vibe-coded scorers will work (they're just LLM calls after all), but "works" and "provides durable business value" are different things.
Interlude: The Starting App
The RAG agent is a standalone Python project. It seeds a database with the Scorable documentation by creating embeddings and inserting them into postgres with the pgvector extension.
Users can ask the agent questions about the Scorable documentation. The agent has a minimalistic system prompt defining the expected knowledge level of the user.
Chapter 2: Pass 1, Vibe Coding the Evals
What We Built
, a pytest file with three LLM-as-judge scorers, each returning a structured
.
Three metrics, chosen because they cover the two failure modes a RAG system has:
| Metric | What it catches |
|---|---|
| Faithfulness | Answer invents facts not present in the retrieved context |
| Answer relevance | Answer is correct but off-topic, or ignores the question |
| Tool call quality | Retrieval wasn't triggered, or the search query was poor |
Each scorer is a
with and a system prompt describing the rubric:
Running the agent on
gets back a score and a one-sentence justification. Extracting the retrieved context requires walking and pulling content, workable but not obvious.
How It Runs
9 test cases (3 questions × 3 metrics). Each asserts
. The justification is included in the assertion message so failures are self-explaining.
What It Took
About 10 minutes of prompting. The scorers are genuinely just LLM calls wrapped in Pydantic models. The hardest part was knowing which message types to look for (
vs ) to extract context and search queries from the agent's message history.
What's Already Missing
The moment you write these tests and run them once, you hit the limits of the DIY approach:
- No history. You get a pass/fail today. You have no idea if scores improved or regressed since last week.
- No aggregates. Average faithfulness across the dataset? Requires more code.
- Flaky by nature. LLM-as-judge scores vary run to run. A threshold of
will flap. Managing that requires more code. The fancy term for what we want to optimize is "confidence interval". - Cost is invisible. Each test run makes 6 LLM calls per question (1 RAG + 3 scorers, run twice across the 3 test functions that re-run the agent). No tracking.
None of these are blockers for getting a score. They could be blockers later on a production system. Having said that, vibe coded evals are a perfectly valid approach to get started with.
A note on "vibe coded": this example is intentionally naive: it's the quickest possible thing that produces a number. In practice, no serious team would start here. They'd reach for DSPy, OpenEvals, MLflow's eval suite, Pydantic evals, or similar libraries that handle prompt templating, output parsing, confidence estimation, and run management out of the box. That's a completely valid approach and likely the right one for most teams building their own eval layer. The point of doing it from scratch here is just to make the components visible before comparing them.
Chapter 3: Pass 2, Scorable Built-in Evaluators
What Changed
, same pytest structure, but the scorer agents are gone. Instead of prompting an LLM ourselves, we call Scorable's pre-built context-aware evaluators:
Two evaluators used:
| Evaluator | What it measures |
|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? |
| Context Recall | Does the retrieved context contain enough information to produce the correct answer? |
requires (a ground truth answer), which forced us to actually write down what correct answers look like, something the vibe-coded version skipped entirely.
Two Small Parameters That Change What the Platform Can Do
Two fields worth calling out explicitly:
and .
passes the agent's actual system prompt to the evaluator. An evaluator judging faithfulness without knowing what the agent was trying to do is scoring in a vacuum; in our case, the policy defining the expected knowledge level of the user. Giving it the system prompt closes that gap, and it gives the platform the raw material to suggest prompt improvements based on evaluation results.
are how you tell the platform what kind of run produced this score: , , , a specific feature flag, a model version. In the vibe-coded version, all scores are just numbers with no provenance. Tags are what let the platform show you "faithfulness in production has been trending down since Monday" rather than just "here is a number."
Neither of these can be replicated in the DIY version without building the tagging and filtering infrastructure yourself, which brings us back to the list in the next chapter.
What the Ground Truth Forced
Adding
to each case is a meaningful shift: you're no longer just asking "is this answer okay?" but "does the system retrieve context that can produce this specific correct answer?" The ground truth is now an explicit artifact you own and can improve over time.
Custom Evaluators
The built-in evaluators cover common RAG cases, but you can also define domain-specific ones through an opinionated structure of intent and calibration examples, with versioning built in. See the custom evaluator docs.
What's Still Missing (But Less of It)
- No history / regression tracking: still just pass/fail per run
- No cost visibility: Scorable handles the LLM calls for evaluation, but you still don't see them in your own dashboards
The evaluators themselves are solved. The surrounding infrastructure (tracking scores over time, alerting on regressions, sharing results with non-engineers) is what a platform like Scorable's dashboard is built for.
Chapter 4: The Real Gap Isn't the Evaluators
The Honest Assessment of What We Compared
If you squint at the two files side by side, the delta is mostly API surface. The vibe version writes a system prompt and calls a pydantic-ai agent. The Scorable version calls
. Both return a score and a justification.
With a few iterations, you could tune the vibe-coded prompts to produce scores that closely match Scorable's. You could add confidence intervals. You could add a rubric that handles edge cases. LLM-as-judge is not a hard problem to approximate; it's a few hundred tokens of prompt engineering. Anyone claiming otherwise is selling something (which I am of course also guilty of doing).
So if the evaluators are equivalent after a bit of polish, why does the build vs. buy question still matter?
What the Score Doesn't Tell You
A score of 0.73 on faithfulness means almost nothing by itself. What you actually want to know:
- Is 0.73 better or worse than last week?
- Which questions score consistently low, and why?
- Did the score drop after you changed the system prompt on Tuesday?
- Is faithfulness lower for users on the free tier vs. paid?
- When a user complains, can you pull up the exact run, its retrieved context, and its scores in one place?
None of these questions are answerable from a pytest output. They require a different layer entirely.
What "Doing It Yourself" Actually Means
Here's what you'd need to build to get from "I have a score" to "I have actionable insight":
- Persistence: write scores to a database, attached to a run ID
- Run metadata: tag each eval with environment (dev/staging/prod), model version, prompt version, user ID, session ID
- Dev vs. prod separation: a score from a pytest run should not pollute your production trend line. This also means instrumenting your production agent, not just your tests
- Visualization: something a non-engineer can open and understand
- Alerting: notify that there are issues with sufficient detail that they are actionable. A naive score < magic number is often not enough
- Insights at scale: if you run thousands of evals daily, raw numbers and justifications are noise. You need something that surfaces patterns, not just rows
- Drill-down: click a low score and see the full trace: question, retrieved chunks, answer, scorer reasoning
- Evaluation of your evaluators: are your judges consistent? Do they agree with human ratings?
- Prompt optimization: once you have scores, there are algorithms (GEPA, DSPy optimizers) that can automatically find better prompts by running your evaluator in a loop. Do you want to learn how those work and wire them up yourself, or just press a button?
Every one of these is a solvable engineering problem. None of them is particularly hard. Together they are several weeks of work that has nothing to do with the product you're actually building.
The Real Cost Is Focus
Every hour spent having Claude build an eval dashboard is an hour not spent on the feature that makes users stay. While implementing new code is trivial, managing a swarm of coding agents is an attention sink.
This is the same trade-off as any dev tooling decision. You can build your own CI pipeline. You can write your own error tracker. You can instrument your own distributed traces. The question was never "is it technically possible?" Of course it is. The question is whether the thing you're building is your competitive advantage, or just infrastructure that enables your competitive advantage.
For most teams, the eval scorer is not the moat. It is just a technical implementation detail of how you craft your AI feature in a robust way. The time should be spent on , understanding your users, iterating on your product, and shipping fast; that's the moat. The evals are just the instrument you use to make sure you're not breaking things while you move.
The Verdict
Build the evaluators if you want to understand how they work; it's a useful exercise and the code is simple. But when you find yourself writing the dashboards, alerting, batch analysis, GEPA loops, ask whether that's where your focus should be.
