# Add Scorable LLM-as-a-Judge to Your Application

These instructions guide you through creating LLM evaluation judges with Scorable and integrating them into your codebase.
Scorable is a tool for creating LLM-as-a-Judge based evaluators for safeguarding applications. Judge is the Scorable term for grouping evaluations from different metrics (Helpfulness, Policy Adherence, etc...)

## Overview

Your role is to:
1. **Analyze the codebase** to identify LLM interactions
2. **Create judges via Scorable API** to evaluate those interactions
3. **Integrate judge execution** into the code at appropriate points
4. **Provide usage documentation** for the evaluation setup

---

## Step 1: Analyze the Application

Examine the codebase to understand:
- What LLM interactions exist (prompts, completions, agent calls)
- What the application does at each interaction point
- Which interactions are most critical to evaluate

If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first.

---

## Step 2: Get Scorable API Key

Ask the user if they want to:
- **Create a temporary API key** (for quick testing). Warn the user that the judge will be public and visible to everyone.
- **Use an existing API key** (for production)

### To create a temporary API key:

```bash
curl --request POST      --url https://api.scorable.ai/create-demo-user/      --header 'accept: application/json'      --header 'content-type: application/json'
```

Response includes `api_key` field. 
Warn the user appropriately, also tell them that the judge will be public and visible to everyone.
If they want to keep it private, tell them to create a permanent API key.

If user wants a permanent key and doesn't have an account, direct them to https://scorable.ai/register.

---

## Step 3: Generate a Judge

Call the `/v1/judges/generate/` endpoint with a detailed `intent` string.

### Intent String Guidelines:
- Describe the application context and what you're evaluating
- Mention the specific execution point (stage name)
- Include critical quality dimensions you care about
- Add examples, documentation links, or policies if relevant
- Be specific and detailed (multiple sentences/paragraphs are good)
- Code level details like frameworks, libraries, etc. do not need to be mentioned

**Example with all required fields filled:**
```bash
curl --request POST      --url https://api.scorable.ai/v1/judges/generate/      --header 'accept: application/json'      --header 'content-type: application/json'      --header 'Authorization: Api-Key <SCORABLE_API_KEY>'      --data '{
       "visibility": "unlisted", # (or "public" if using a temporary API key)
       "intent": "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM'''s output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications.",
       "generating_model_params": {
         "temperature": 0.2,
         "reasoning_effort": "medium"
       }
     }'
```

Note that this can take up to 2 minutes to complete.

### Handling API Responses:

The API may return:

**1. `missing_context_from_system_goal`** - Additional context needed:
```json
{
  "missing_context_from_system_goal": [
    {
      "form_field_name": "target_audience",
      "form_field_description": "The intended audience for the content"
    }
  ]
}
```
→ Ask the user for these details (if not evident from the code base), then call `/v1/judges/generate/` again with:
```json
{
  "judge_id": "existing-judge-id",
  "stage": "Stage name",
  "extra_contexts": {
    "target_audience": "Enterprise customers"
  },
  ...other fields...
}
```

**2. `multiple_stages`** - Judge detected multiple evaluation points:
```json
{
  "error_code": "multiple_stages",
  "stages": ["Stage 1", "Stage 2", "Stage 3"]
}
```
→ Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages.

**3. Success** - Judge created:
```json
{
  "judge_id": "abc123...",
  "evaluator_details": [...]
}
```
→ Proceed to integration.

---

## Step 4: Integrate Judge Execution

Add code to evaluate LLM outputs at the appropriate execution point(s).
If the codebase is using a framework, check if there are integration instructions in Scorable docs (using curl is enough) https://docs.scorable.ai/llms.txt

### Python Example:
```python
from scorable import Scorable

# Synchronous
client = Scorable(api_key="your-api-key")
result = client.judges.run(
    judge_id="judge-id-here",
    request="INPUT to the LLM (optional)",
    response="OUTPUT from the LLM (required)"
)

# Async
client = Scorable(api_key="your-api-key", run_async=True)
result = await client.judges.arun(
    judge_id="judge-id-here",
    request="INPUT to the LLM",
    response="OUTPUT from the LLM"
)

// Results are pydantic models
print(result.evaluator_results[0].score) # Score is a float between 0 and 1
print(result.evaluator_results[0].justification) # Justification is a string
```

### TypeScript/JavaScript Example:
```typescript
import { Scorable } from "@root-signals/scorable";

const client = new Scorable({ apiKey: "your-api-key" });

const result = await client.judges.execute(
  "judge-id-here",
  {
    request: "What is the refund policy?",
    response: "You can return items within 30 days.",
  }
);
// result.evaluator_results[0].score
```

### Other Languages (cURL as template):
```bash
curl 'https://api.scorable.ai/v1/judges/{judge_id}/execute/'   -H 'authorization: Api-Key your-api-key'   -H 'content-type: application/json'   --data-raw '{"response":"LLM output here","request":"User input here"}'
```

### Integration Points:
- Insert evaluation code where LLM outputs are generated (for example after an OpenAI responses call)
- `response` parameter: The text you want to evaluate (required)
- `request` parameter: The input that prompted the response (optional but recommended)
- Use actual variables from your code, not static strings

### Result Format:
```json
{
  "evaluator_results": [
    {
      "evaluator_name": "Accuracy",
      "score": 0.85,
      "justification": "The response correctly identifies..."
    }
  ]
}
```
Each evaluator returns a score (0-1, higher is better) and natural language justification.

---

## Step 5: Provide Next Steps

After integration:

1. **Ask about additional judges**: If multiple stages were identified, ask if the user wants to create judges for other stages
2. **Discuss evaluation strategy**:
   - Should every LLM call be evaluated or sampled (e.g., 10%)?
   - Should scores be stored in a database for analysis?
   - Should specific actions trigger based on scores (e.g., alerts for low scores)?
   - Batch evaluation vs real-time evaluation?

3. **Provide judge details**:
   - Judge URL: `https://scorable.ai/judge/{judge_id}`
     - If you used a temporary key, you must include it base64 encoded in the url as a query parameter: https://scorable.ai/judge/{judge_id}?token={base64 encoded temporary key}
   - How to view results in the Scorable dashboard (https://scorable.ai/dashboard)
   - If temporary key was used, a note that it only works for a certain amount of time and they should create an account with a permanent key

4. **Link to docs**: https://docs.scorable.ai

---

## Key Implementation Notes

- **Install SDK first**: Check which dependency management system is used and install the appropriate package.
- **Store API keys securely**: Use environment variables, not hardcoded strings
- **Handle errors gracefully**: Evaluation failures shouldn't break your application
- **Start simple**: Evaluate one stage first, then expand
- **Sampling for production**: 5-10% sampling reduces costs while maintaining visibility

---

## Common Patterns

### Pattern 1: Development (100% evaluation)
```python
response = llm.generate(prompt)
eval_result = client.judges.run(judge_id, request=prompt, response=response)
log_evaluation(eval_result)
```

### Pattern 2: Production with Sampling (10% evaluation)
```python
response = llm.generate(prompt)
if random.random() < 0.1:  # 10% sampling
    eval_result = client.judges.run(judge_id, request=prompt, response=response)
    store_evaluation_in_db(eval_result)
```

### Pattern 3: Batch Evaluation
See https://docs.scorable.ai/usage/cookbooks/batch-evaluation