Updated: 2026-04-15 By: Ari Heljakka
Short answer
Programmatic rule evaluations are deterministic checks that score an LLM's output against explicit, codeable criteria: exact match, substring match, regex, schema validation, length, lexical overlap (ROUGE, Levenshtein); they are "dummy but efficient," fast and cheap and reproducible, but fundamentally unable to capture the dynamic, open-ended, unpredictable real-world answers that LLM-based systems like chatbots actually produce. A rule can confirm that a response is valid JSON or contains a required disclaimer; it cannot tell you whether the answer was helpful, grounded, or even on-topic. That is the structural ceiling: rules cover the cases where "correct" is well-defined, leaving LLM judges to handle everything that matters about free-form language (intent, nuance, semantic equivalence, faithfulness). They remain the natural first tier in any evaluation stack precisely because they are cheap enough to run on everything. A defensible evaluation pipeline runs rules on 100 percent of traffic and judges on a sampled fraction, both reading from the same versioned ground-truth dataset.
Key facts
- Definition: Deterministic algorithmic checks that score LLM outputs against explicit criteria, with or without a reference answer.
- When to use: Format-driven outputs (JSON, code, structured extraction), tool-call generation, length constraints, banned-phrase enforcement, and anywhere "correct" can be codified.
- Limitations: Rules cannot capture semantic equivalence, paraphrase, intent, or context-dependent correctness. They are tier one, not the whole stack.
- Example: A structured extraction pipeline validates JSON schema, required fields, and exact field-value matches programmatically, then sends only ambiguous cases to an LLM judge for semantic adjudication.
Key takeaways
- Programmatic rules are the cheapest credible evaluator; if a check can be codified, code it.
- Rules with expected output (exact match, ROUGE, Levenshtein) compare against a reference; rules without expected output (regex, schema, length) validate structural properties.
- The right mental model is tiering: rules first, judges second, humans third. Each tier handles what the previous cannot.
- Rules are reproducible by construction; their version is the source file and the dataset they run against.
- Composing rules with semantic judges produces a scorecard that is both auditable and nuanced.
Definition
A programmatic rule evaluation runs an LLM output through a deterministic algorithm to verify that it satisfies a predefined criterion. The algorithm produces a normalized score (binary 0 or 1, or graded 0 to 1) without consulting a language model.
The rules split into two categories.
- Algorithms that require expected output. A reference answer or property exists, and the rule measures similarity or equality against it.
- Algorithms that do not require expected output. No reference is needed; the rule validates a structural property of the output itself.
Both categories are first-class evaluators: each has a version (the rule source plus its configuration), a dataset it runs against, and a score that composes into a per-dimension scorecard.
When this matters
Programmatic rules earn their keep when at least one of the following holds.
- The output has a well-defined shape. JSON outputs, tool-call arguments, structured extraction, code generation. Schema validation alone catches a large fraction of failures cheaply.
- The latency or cost budget is tight. Rules run in milliseconds at near-zero cost; LLM judges do not.
- Auditability matters. Rules produce trivially explainable verdicts; a regex either matched or it did not. LLM judges produce written justifications that themselves need evaluation.
- The failure mode is enumerable. Banned phrases, required fields, format violations, length overruns. If you can list the failure mode, you can code it.
- You want a tier-one filter. Most production systems run rules on 100 percent of traffic and judges on a sampled fraction; rules carry the volume, judges handle the ambiguity.
How it works
A defensible programmatic evaluation tier has four components.
Component 1: Pick the rule type for the criterion
Rules with expected output measure similarity or equality against a reference.
- Exact match. Binary check: does the output equal the reference exactly? Strictest, useful for canonical answers (codes, IDs, deterministic labels).
- Substring match. Does the output contain the expected substring? Useful for factual claims where the wording is flexible but the fact is fixed.
- Levenshtein distance. Character-edit distance, normalized to a 0-to-1 similarity. Useful for near-match scoring when small variation is acceptable.
- ROUGE-1 and ROUGE-2. Word-overlap and bigram-overlap scores against a reference summary. Useful for content coverage in summarization.
Rules without expected output validate structural properties.
- Regular expressions. Pattern match for tool-call formats, citation formats, banned phrases.
- Schema validation. JSON schema or equivalent; required fields, types, enum constraints.
- Length count. Characters, words, tokens, sentences. Often a floor or ceiling rather than a target.
Each rule is one evaluator. Composing several rules gives a per-dimension scorecard.
Component 2: Source the expected output from a versioned ground-truth dataset
For rules that require a reference, the reference comes from a versioned ground-truth dataset: hand-labeled, with documented inter-rater agreement, split into a calibration set and a regression set. The dataset is the contract; rules read from it.
Without a versioned dataset, exact-match scores look precise but are not credible; they measure agreement with a moving target.
Component 3: Run the rules in two modes
The same rules run in two modes.
- Offline regression mode. Before deployment, the rules score the candidate output against the regression set. Failing rules block the deployment.
- Online mode. In production, the rules run on 100 percent of traffic at near-zero cost. Their output is logged and aggregated as a continuous quality signal.
Because rules are deterministic, the offline and online scores are directly comparable; divergence is itself a signal (usually that production inputs have drifted away from the regression set).
Component 4: Tier rules with semantic judges
Rules cover the enumerable cases. LLM judges cover the semantic cases: paraphrase, intent, nuance, context-dependent correctness. The natural composition is:
- Rules run first, on 100 percent of traffic.
- LLM judges run second, on the sampled fraction that rules cannot adjudicate.
- Human review runs third, on the smallest fraction (sensitive, ambiguous, or escalated).
Each tier emits a normalized score on the same per-dimension scorecard. Each tier has a version, a cost, and a calibration agreement metric (for judges) or a precision-and-recall measurement (for rules).
Example
A team running a structured invoice extraction pipeline scores outputs across five dimensions: schema compliance, field accuracy, completeness, unit consistency, and refusal correctness on out-of-scope documents.
Schema compliance. JSON schema validation, deterministic, runs on 100 percent of traffic. Score is binary 0 or 1.
Field accuracy. Exact match against the labeled reference for ID fields (invoice number, customer ID). Substring match for free-text fields (customer name) when the labeled reference accepts variants. Levenshtein similarity for OCR-prone fields (addresses) with a 0.85 floor.
Completeness. Programmatic count of required fields present; binary 0 or 1 per required field; aggregate is the fraction present.
Unit consistency. Regex catches mixed currency or unit markers in numeric fields. Binary 0 or 1.
Refusal correctness. An LLM judge handles this dimension; rules cannot tell whether a refusal was appropriate for the input. Judge v1.5, calibrated against an 80-example ground-truth set, current Matthews 0.69.
Result. Rules cover four of five dimensions at near-zero cost on 100 percent of traffic. The judge runs only on the fifth dimension, on a sampled fraction. Per-dimension scores compose into a scorecard tied to the tuple (extraction prompt v3.2, model v6.1, judge v1.5, dataset v4, schema v2).
When the schema or the labeled reference changes, the dataset version increments and rules are rerun against the new version. The scorecard remains comparable across changes because the evaluation contract is versioned.
Limitations
- Rules miss semantic equivalence. "The total is $1,200" and "Twelve hundred dollars" pass different rules but mean the same thing. Use judges for paraphrase.
- Rules miss intent. A refusal is not measurable by a rule; whether the refusal was correct for the input is a semantic judgment.
- Brittleness to minor variation. Exact match collapses under whitespace differences; preprocess inputs to normalize before applying strict rules.
- Reference quality is the ceiling. If the labeled reference is wrong, the rule that matches it is also wrong. Treat the dataset as a versioned artifact with its own quality process.
- Composition across rules can double-count. Two rules that fire on the same failure mode inflate the regression. Decompose by what fails independently.
Evidence and sources
- ROUGE: A Package for Automatic Evaluation of Summaries (Lin, 2004), aclanthology.org/W04-1013
- Levenshtein distance for string similarity (original 1965 formulation), en.wikipedia.org/wiki/Levenshtein_distance
- Structured output validation in LLM pipelines, arxiv.org/abs/2310.10503
FAQ
When should I use programmatic rules instead of an LLM judge? Whenever the criterion can be codified. Schema, format, length, exact match, substring presence, regex patterns. Rules are cheaper, faster, and trivially auditable. Use a judge only for what rules cannot capture.
Can I run rules and judges on the same dimension? Yes, and you often should. A rule covers the strict case; a judge covers the semantic case. Both feed the same per-dimension scorecard.
What about ROUGE for summarization? ROUGE is a useful tier-one signal for content overlap. It is not a substitute for a faithfulness judge; outputs can ROUGE-match a reference and still hallucinate. Compose ROUGE with a faithfulness judge.
How do I version a rule? Treat the rule source (regex, schema, configuration) as a first-class artifact. Tag it with a version; record the version in every evaluation run. Changes to the rule are changes to the evaluation contract and warrant a deliberate update to the dataset and any downstream comparisons.
Are rules enough by themselves? For format-driven outputs with well-enumerated failure modes, often yes. For free-form generation, conversational outputs, or anything where semantic equivalence matters, no. Tier rules with judges.