# Add Scorable LLM-as-a-Judge to Your Application These instructions guide you through creating LLM evaluation judges with Scorable and integrating them into your codebase. Scorable is a tool for creating LLM-as-a-Judge based evaluators for safeguarding applications. Judge is the Scorable term for grouping evaluations from different metrics (Helpfulness, Policy Adherence, etc...) ## Overview Your role is to: 1. **Analyze the codebase** to identify LLM interactions 2. **Create judges via Scorable API** to evaluate those interactions 3. **Integrate judge execution** into the code at appropriate points 4. **Provide usage documentation** for the evaluation setup --- ## Step 1: Analyze the Application Examine the codebase to understand: - What LLM interactions exist (prompts, completions, agent calls) - What the application does at each interaction point - Which interactions are most critical to evaluate If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first. --- ## Step 2: Get Scorable API Key Ask the user if they want to: - **Create a temporary API key** (for quick testing). Warn the user that the judge will be public and visible to everyone. - **Use an existing API key** (for production) ### To create a temporary API key: ```bash curl --request POST --url https://api.scorable.ai/create-demo-user/ --header 'accept: application/json' --header 'content-type: application/json' ``` Response includes `api_key` field. Warn the user appropriately, also tell them that the judge will be public and visible to everyone. If they want to keep it private, tell them to create a permanent API key. If user wants a permanent key and doesn't have an account, direct them to https://scorable.ai/register. --- ## Step 3: Generate a Judge Call the `/v1/judges/generate/` endpoint with a detailed `intent` string. ### Intent String Guidelines: - Describe the application context and what you're evaluating - Mention the specific execution point (stage name) - Include critical quality dimensions you care about - Add examples, documentation links, or policies if relevant - Be specific and detailed (multiple sentences/paragraphs are good) - Code level details like frameworks, libraries, etc. do not need to be mentioned **Example with all required fields filled:** ```bash curl --request POST --url https://api.scorable.ai/v1/judges/generate/ --header 'accept: application/json' --header 'content-type: application/json' --header 'Authorization: Api-Key ' --data '{ "visibility": "unlisted", # (or "public" if using a temporary API key) "intent": "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM'''s output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications.", "generating_model_params": { "temperature": 0.2, "reasoning_effort": "medium" } }' ``` Note that this can take up to 2 minutes to complete. ### Handling API Responses: The API may return: **1. `missing_context_from_system_goal`** - Additional context needed: ```json { "missing_context_from_system_goal": [ { "form_field_name": "target_audience", "form_field_description": "The intended audience for the content" } ] } ``` → Ask the user for these details (if not evident from the code base), then call `/v1/judges/generate/` again with: ```json { "judge_id": "existing-judge-id", "stage": "Stage name", "extra_contexts": { "target_audience": "Enterprise customers" }, ...other fields... } ``` **2. `multiple_stages`** - Judge detected multiple evaluation points: ```json { "error_code": "multiple_stages", "stages": ["Stage 1", "Stage 2", "Stage 3"] } ``` → Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages. **3. Success** - Judge created: ```json { "judge_id": "abc123...", "evaluator_details": [...] } ``` → Proceed to integration. --- ## Step 4: Integrate Judge Execution Add code to evaluate LLM outputs at the appropriate execution point(s). If the codebase is using a framework, check if there are integration instructions in Scorable docs (using curl is enough) https://docs.scorable.ai/llms.txt ### Python Example: ```python from scorable import Scorable # Synchronous client = Scorable(api_key="your-api-key") result = client.judges.run( judge_id="judge-id-here", request="INPUT to the LLM (optional)", response="OUTPUT from the LLM (required)" ) # Async client = Scorable(api_key="your-api-key", run_async=True) result = await client.judges.arun( judge_id="judge-id-here", request="INPUT to the LLM", response="OUTPUT from the LLM" ) // Results are pydantic models print(result.evaluator_results[0].score) # Score is a float between 0 and 1 print(result.evaluator_results[0].justification) # Justification is a string ``` ### TypeScript/JavaScript Example: ```typescript import { Scorable } from "@root-signals/scorable"; const client = new Scorable({ apiKey: "your-api-key" }); const result = await client.judges.execute( "judge-id-here", { request: "What is the refund policy?", response: "You can return items within 30 days.", } ); // result.evaluator_results[0].score ``` ### Other Languages (cURL as template): ```bash curl 'https://api.scorable.ai/v1/judges/{judge_id}/execute/' -H 'authorization: Api-Key your-api-key' -H 'content-type: application/json' --data-raw '{"response":"LLM output here","request":"User input here"}' ``` ### Integration Points: - Insert evaluation code where LLM outputs are generated (for example after an OpenAI responses call) - `response` parameter: The text you want to evaluate (required) - `request` parameter: The input that prompted the response (optional but recommended) - Use actual variables from your code, not static strings ### Result Format: ```json { "evaluator_results": [ { "evaluator_name": "Accuracy", "score": 0.85, "justification": "The response correctly identifies..." } ] } ``` Each evaluator returns a score (0-1, higher is better) and natural language justification. --- ## Step 5: Provide Next Steps After integration: 1. **Ask about additional judges**: If multiple stages were identified, ask if the user wants to create judges for other stages 2. **Discuss evaluation strategy**: - Should every LLM call be evaluated or sampled (e.g., 10%)? - Should scores be stored in a database for analysis? - Should specific actions trigger based on scores (e.g., alerts for low scores)? - Batch evaluation vs real-time evaluation? 3. **Provide judge details**: - Judge URL: `https://scorable.ai/judge/{judge_id}` - If you used a temporary key, you must include it base64 encoded in the url as a query parameter: https://scorable.ai/judge/{judge_id}?token={base64 encoded temporary key} - How to view results in the Scorable dashboard (https://scorable.ai/dashboard) - If temporary key was used, a note that it only works for a certain amount of time and they should create an account with a permanent key 4. **Link to docs**: https://docs.scorable.ai --- ## Key Implementation Notes - **Install SDK first**: Check which dependency management system is used and install the appropriate package. - **Store API keys securely**: Use environment variables, not hardcoded strings - **Handle errors gracefully**: Evaluation failures shouldn't break your application - **Start simple**: Evaluate one stage first, then expand - **Sampling for production**: 5-10% sampling reduces costs while maintaining visibility --- ## Common Patterns ### Pattern 1: Development (100% evaluation) ```python response = llm.generate(prompt) eval_result = client.judges.run(judge_id, request=prompt, response=response) log_evaluation(eval_result) ``` ### Pattern 2: Production with Sampling (10% evaluation) ```python response = llm.generate(prompt) if random.random() < 0.1: # 10% sampling eval_result = client.judges.run(judge_id, request=prompt, response=response) store_evaluation_in_db(eval_result) ``` ### Pattern 3: Batch Evaluation See https://docs.scorable.ai/usage/cookbooks/batch-evaluation