# Add Scorable LLM-as-a-Judge to Your Application These instructions guide you through creating LLM evaluation judges with Scorable and integrating them into your codebase. Scorable is a tool for creating LLM-as-a-Judge based evaluators for safeguarding applications. Judge is the Scorable term for grouping evaluations from different metrics (Helpfulness, Policy Adherence, etc...) ## Execution Contract You are responsible for completing Scorable setup and integration end-to-end in as few turns as possible. - You MUST analyze the codebase for LLM interaction points. - You MUST install and use Scorable CLI directly. - You MUST execute judge generation commands yourself. - You MUST integrate judge execution into code yourself. - You MUST run verification checks after changes. - You MUST update project documentation for usage. - You MUST NOT delegate technical steps to the user except where explicitly required like setting up the API key if not using a temporary key. - You MUST continue until implementation is complete or a hard blocker is reached. ## Overview Your role is to: 1. **Analyze the codebase** to identify LLM interactions 2. **Create judges via the Scorable CLI** to evaluate those interactions (or use an existing judge ID if provided) 3. **Integrate judge execution** into the code at appropriate points 4. **Provide usage documentation** for the evaluation setup **Note:** These instructions work for both creating new judges from scratch and integrating existing judges. If the user provides a judge ID, you can skip the judge creation step (Step 3) and proceed directly to integration (Step 4). --- ## Step 0: Explain the process Before performing any analysis or technical steps, pause and clearly brief the user on what is about to happen. Explain that you will: - Analyze the codebase to identify LLM interactions - Create judges via the Scorable CLI to evaluate those interactions - Integrate judge execution into the code at appropriate points - Provide usage documentation for the evaluation setup --- ## Step 1: Analyze the Application Examine the codebase to understand: - What LLM interactions exist (prompts, completions, agent calls) - What the application does at each interaction point - Which interactions are most critical to evaluate If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first. --- ## Step 2: Install the Scorable CLI Install the CLI so the user can generate and manage judges from the terminal. ### Installation ```bash curl -sSL https://scorable.ai/cli/install.sh | sh ``` Or with npm directly: ```bash npm install -g @root-signals/scorable-cli ``` Or run without installing via npx: ```bash npx @root-signals/scorable-cli judge list ``` ### Authentication Get a free demo key (no registration required): ```bash scorable auth demo-key ``` Or set a permanent key from [scorable.ai/api-key-setup](https://scorable.ai/api-key-setup): ```bash scorable auth set-key # and then paste the key # or alternatively scorable auth set-key ``` Or use an environment variable: ```bash export SCORABLE_API_KEY="sk-your-api-key" ``` **Security:** Instruct the user to use environment variables or the project's secret management for the API key. Do not ask the user to paste the key into this session but instruct them to use the scorable auth set-key command to set the key. If a temporary demo key was used, warn the user that: - Judges created with it will be public and visible to everyone - The key only works for a limited time - For private judges, they should create a permanent key at https://scorable.ai/register --- ## Step 3: Generate a Judge **Note:** If the user has already provided a judge ID (e.g., in their message), you can skip this step and proceed directly to Step 4 (Integration). ### Intent String Guidelines: - Describe the application context and what you're evaluating - Mention the specific execution point (stage name) - Include critical quality dimensions you care about - Add examples, documentation links, or policies if relevant - Be specific and detailed (multiple sentences/paragraphs are good) - Code-level details like frameworks and libraries do not need to be mentioned ### Using the Scorable CLI Note, you should run these commands, so after user has authenticated, you should take the control back and run the commands yourself. ```bash scorable judge generate --intent "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM's output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications." --visibility private --reasoning-effort medium ``` Use `--visibility public` if using a temporary API key. **Optional fields:** - `enable_context_aware_evaluators`: Set to `true` if the application uses RAG (document chunks) — enables hallucination detection, context drift, etc. ### Handling Judge Generation Responses: The judge generation may return: **1. `missing_context_from_system_goal`** - Additional context needed: → Ask the user for these details (if not evident from the codebase), then re-run with the additional context. CLI: ```bash scorable judge generate --intent "..." --judge-id --extra-contexts '{"target_audience":"Enterprise customers"}' ``` **2. `multiple_stages`** - Judge detected multiple evaluation points: ```json { "error_code": "multiple_stages", "stages": ["Stage 1", "Stage 2", "Stage 3"] } ``` → Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages. Re-run with `--stage ""` (CLI) **3. Success** - Judge created: ```json { "judge_id": "abc123...", "evaluator_details": [...] } ``` → Proceed to integration. --- ## Step 4: Integrate Judge Execution Add code to evaluate LLM outputs at the appropriate execution point(s). If the codebase is using a framework, check if there are integration instructions in Scorable docs (using curl is enough): https://docs.scorable.ai/llms.txt ### Python Example: ```python from scorable import Scorable # Synchronous client = Scorable(api_key="your-api-key") result = client.judges.run( judge_id="judge-id-here", request="INPUT to the LLM (optional)", response="OUTPUT from the LLM (required)" ) # Async client = Scorable(api_key="your-api-key", run_async=True) result = await client.judges.arun( judge_id="judge-id-here", request="INPUT to the LLM", response="OUTPUT from the LLM" ) # Results are pydantic models print(result.evaluator_results[0].score) # float between 0 and 1 print(result.evaluator_results[0].justification) # string ``` ### TypeScript/JavaScript Example: ```typescript import { Scorable } from "@root-signals/scorable"; const client = new Scorable({ apiKey: "your-api-key" }); const result = await client.judges.execute( "judge-id-here", { request: "What is the refund policy?", response: "You can return items within 30 days.", } ); // result.evaluator_results[0].score ``` ### Other Languages (cURL as template): ```bash curl 'https://api.scorable.ai/v1/judges/{judge_id}/execute/' -H 'authorization: Api-Key your-api-key' -H 'content-type: application/json' --data-raw '{"response":"LLM output here","request":"User input here"}' ``` ### RAG **If you identify the application uses RAG (Retrieval Augmented Generation)**, you MUST include the `contexts` parameter. ```python eval_result = client.judges.run( judge_id="judge-id", request="User question", response="LLM response", contexts=["retrieved doc 1", "retrieved doc 2", ...] # REQUIRED for RAG ) ``` Contexts parameter is available in all SDKs and in the API. ### Optional parameters for execute call. Available in the SDKs and API. Use ONLY if relevant to the evaluation. - `contexts`: If a RAG setup is used, a list of context snippets to evaluate the response against - `user_id`: The user ID of the user interacting with the application - `tags`: Tag the evaluation for easier filtering and analysis, e.g. `["production", "v1.0"]` - `expected_output`: The expected output of the response ### Integration Points: - Insert evaluation code where LLM outputs are generated (for example after an OpenAI responses call) - `response` parameter: The text you want to evaluate (required) - `request` parameter: The input that prompted the response (optional but recommended) - Use actual variables from your code, not static strings ### Result Format: ```json { "evaluator_results": [ { "evaluator_name": "Accuracy", "score": 0.85, "justification": "The response correctly identifies..." } ] } ``` Each evaluator returns a score (0-1, higher is better) and natural language justification. --- ## Step 5: Provide Next Steps After integration: 1. **Ask about additional judges**: If multiple stages were identified, ask if the user wants to create judges for other stages 2. **Discuss evaluation strategy**: - Should every LLM call be evaluated or sampled (e.g., 10%)? - Should scores be stored in a database for analysis? - Should specific actions trigger based on scores (e.g., alerts for low scores)? - Batch evaluation vs real-time evaluation? 3. **Provide judge details**: - Judge URL: `https://scorable.ai/judge/{judge_id}` - If you used a temporary key, include the `api_token` base64-encoded as a query parameter: `https://scorable.ai/judge/{judge_id}?token={base64 encoded temporary api_token}` - How to view results in the Scorable overview (https://scorable.ai/overview) - If temporary key was used, a note that it only works for a certain amount of time and they should create an account with a permanent key 4. **CLI usage**: - Tell them they can inspect, run, get execution logs and manage judges and evaluators using the scorable cli. 4. **Link to docs**: https://docs.scorable.ai - For agentic workflows with tool calls or multi-turn conversations, link to https://docs.scorable.ai/usage/usage/judges#multi-turn-conversations --- ## Key Implementation Notes - **Install SDK first**: Check which dependency management system is used and install the appropriate package. - **Store API keys securely**: Use environment variables, not hardcoded strings - **Handle errors gracefully**: Evaluation failures shouldn't break your application - **Start simple**: Evaluate one stage first, then expand - **Sampling for production**: 5-10% sampling reduces costs while maintaining visibility - **Non-blocking**: The evaluation should not block the main thread or slow down the application --- ## Common Patterns ### Pattern 1: Development (100% evaluation) ```python response = llm.generate(prompt) eval_result = client.judges.run(judge_id, request=prompt, response=response) log_evaluation(eval_result) ``` ### Pattern 2: Production with Sampling (10% evaluation) ```python response = llm.generate(prompt) if random.random() < 0.1: # 10% sampling eval_result = client.judges.run(judge_id, request=prompt, response=response) store_evaluation_in_db(eval_result) ``` ### Pattern 3: Batch Evaluation See https://docs.scorable.ai/usage/cookbooks/batch-evaluation