Updated: 2026-04-07 By: Ari Heljakka
Short answer
Rule-based filters and LLM-powered moderation are not rivals, they are complementary tiers. Rule-based filters handle the bulk of obvious cases in milliseconds at near-zero cost. LLM moderators handle context, sarcasm, coded language, and policy nuance, at higher latency and higher cost. Most production systems run both, with rules in front and an LLM judge behind, escalating only ambiguous content.
Key facts
- Definition: Rule-based moderation uses deterministic patterns (regex, blocklists, perceptual hashes, classical classifiers) to flag content. LLM moderation uses a large language model, often prompted as a judge, to score content against a policy written in natural language.
- When to use: Rules for high-volume, latency-critical paths and clearly enumerable harms. LLMs for nuanced or context-dependent policies, multilingual surfaces, and evolving abuse patterns. Hybrid for almost everything else.
- Limitations: Rules are brittle to evasion and cannot judge intent. LLMs are slower, more expensive, and can be inconsistent without calibration. Neither is a substitute for human review on the most sensitive cases. Both need golden datasets and continuous regression testing to remain trustworthy in production.
- Example: A live chat platform runs regex and hash checks under 10 ms to filter the obvious 97% of messages, then sends the remaining 3% to an LLM judge that scores them against a written policy before allowing, blocking, or queuing them for human review. Both tiers are evaluated continuously against a golden dataset with regressions caught before deployment.
Key takeaways
- The right question is rarely "rules or LLM," it is "what fraction of traffic should each tier own."
- Rule-based filters are fast, cheap, auditable, and trivially explainable. They miss context and they break against evasion.
- LLM moderators are flexible, multilingual, and context-aware. They are slower, more expensive, and they need calibration and evaluation to stay trustworthy.
- A tiered pipeline (rules, lightweight classifier, LLM judge, human review) tends to win on cost, latency, and recall at the same time.
- Whatever you ship, treat the moderation system itself as an evaluable component: it needs golden datasets, regression tests, and ongoing scoring.
Definition
Rule-based filters are deterministic moderation systems built from explicit patterns: regular expressions, exact-match blocklists, perceptual hashes for known harmful media, and sometimes lightweight statistical classifiers such as logistic regression or fastText models. Output is binary or thresholded, and the rule that fired is itself the explanation.
LLM moderation uses a large language model, typically prompted as an LLM-as-a-judge, to read content and a policy and return a structured verdict, usually a score plus a written justification. The policy lives in a prompt, not in code, which is what makes the approach flexible. Variants include zero-shot prompted judges, fine-tuned safety classifiers, and small specialist models distilled from a frontier teacher.
Both fall under the umbrella of "guardrails" when they run inline in the request path, and under "evaluators" when they run asynchronously on traces. The same techniques work in either mode. The key insight is that a moderation system (whether rule-based, LLM-based, or hybrid) is itself an evaluable component with an explicit objective (the policy) and an implementation (the mechanism that enforces it). The objective and implementation must be decoupled and independently measurable: any moderation system must be scored, gated, and continuously monitored against ground truth regardless of how it is built.
When this matters
The choice between rules, LLMs, or a hybrid is forced by four pressures:
- Latency budget. Live chat, gaming, voice, and any synchronous user-facing surface typically have a sub-100 ms budget. Rules fit, frontier LLMs usually do not.
- Scale and cost. A platform moderating hundreds of millions of messages per day cannot afford to send all of them to an LLM. Rules absorb the cheap majority.
- Policy complexity. Idioms, sarcasm, coded language, multilingual abuse, and context-dependent harms exceed what pattern matching can capture. LLMs handle these.
- Auditability. Regulated surfaces and trust-and-safety reviews need an explanation for every decision. Rules produce trivial explanations; LLM judges produce written ones that themselves need evaluation.
If any one of those pressures dominates, the answer is usually obvious. When several conflict, a hybrid pipeline is the standard resolution.
How it works
Rule-based filters
A typical rule-based moderation stack chains several deterministic steps:
- Normalization. Lowercase, unicode-fold, strip zero-width characters, expand leetspeak, transliterate.
- Exact matching. Hashed blocklists for known-bad terms, URLs, file hashes (e.g. perceptual hashes for known CSAM or copyrighted images).
- Pattern matching. Regex over normalized text for slurs, doxxing patterns, payment scams, prompt-injection markers.
- Lightweight classifiers. Optionally a small ML model (logistic regression, fastText, distilled BERT-class) for categories where simple patterns underperform.
Every stage emits a label and a reason. Latency is dominated by I/O, not compute. Each stage should be independently evaluable against a subset of the golden dataset, with regressions caught before production deployment.
LLM moderation
A typical LLM judge pipeline:
- Policy prompt. A written description of what counts as a violation, with examples and edge cases. The policy itself is the objective, independent of implementation.
- Dimensional decomposition. Complex policies are often decomposed into independent violation dimensions (toxicity, harassment, policy_violation, etc.), each scored as a separate 0-1 metric. This allows composition and targeted optimization on specific failure modes.
- Structured output. The judge returns JSON:
. - Calibration. Judge thresholds are tuned against a labeled ground truth dataset; agreement with human raters is the primary quality gate. Calibration is a continuous process: recalibrate whenever the model, prompt, or live distribution changes.
- Sampling and routing. In production, only a fraction of traffic, often the ambiguous slice flagged by an upstream rule, reaches the judge.
The judge prompt is versioned, the ground truth dataset is versioned, and judge agreement is itself a metric tracked over time. Without that loop, an LLM moderator drifts silently when models change. The system must be model-agnostic: if the judge implementation swaps from one model to another, the objective and evaluation criteria remain stable.
Hybrid pipelines
The pattern that most large platforms converge on has four tiers:
- Tier 1, Rules: sub-10 ms, handles the obvious safe and obvious bad.
- Tier 2, Lightweight ML: 10 to 100 ms, handles known categories that resist regex.
- Tier 3, LLM judge: one to three seconds, handles context and nuance on the ambiguous slice. The judge may decompose the policy into independent dimensions (toxicity, harassment, policy violations) and score each 0-1, enabling fine-grained optimization.
- Tier 4, Human review: minutes to hours, handles the highest-stakes cases.
A well-tuned cascade can route the majority of traffic through Tier 1, escalate only a few percent to the LLM, and keep human reviewers focused on the ambiguous tail. Each tier is a distinct evaluator in the pipeline; the composition is itself an evaluable system with explicit quality gates and regression thresholds between tiers. The entire system—all tiers together—has a measurable objective (the policy) that is independent of the implementation details of any single tier.
Example
A live chat platform moderating ten million messages per day:
- Tier 1, Rules. Regex over normalized text catches obvious slurs and known scam URLs. Perceptual hash lookups catch known harmful images. Estimated cost per message is fractions of a cent; latency is single-digit milliseconds. Roughly the vast majority of messages clear or block at this tier.
- Tier 2, Lightweight classifier. A small classifier independently scores each residual message for toxicity, harassment, and self-harm indicators in tens of milliseconds. Each dimension is a separate 0-1 metric.
- Tier 3, LLM judge. The remaining ambiguous slice (idioms, sarcasm, coded harassment) is sent to a small instruction-tuned LLM acting as a judge against a written policy. Output is
, with each violation dimension scored independently. Latency one to three seconds; only the slice that needs it is paying that cost. - Tier 4, Human review. A small fraction is queued for human review with the LLM's justification attached, which itself becomes a labeled training example.
The result, qualitatively reported across multiple production write-ups, is dramatically lower inference cost than a full-LLM deployment, dramatically better recall than a rules-only deployment, and an evidence trail that survives a trust-and-safety audit.
Comparison
A criteria-driven view, with the wins distributed across both approaches:
| Criterion | Rule-based filters | LLM moderation |
|---|---|---|
| Latency | Wins. Sub-10 ms is typical. | Loses. Hundreds of milliseconds to seconds. |
| Per-decision cost | Wins. Fractions of a cent. | Loses. Cents to tens of cents, depending on model. |
| Recall on novel abuse | Loses. Misses anything not in the pattern list. | Wins. Generalizes to unseen phrasings. |
| Context and intent | Loses. Cannot distinguish "killed it" the compliment from "killed it" the threat. | Wins. Reads the surrounding turn. |
| Multilingual coverage | Loses without per-language rule packs. | Wins. Frontier models handle dozens of languages out of the box. |
| Evasion resistance | Loses to leetspeak, homoglyphs, novel slang unless normalization is exhaustive. | Wins on linguistic evasion, loses to adversarial prompt-injection-style attacks. |
| Explainability | Wins. The rule that fired is the reason. | Partial. Justification is a written paragraph that itself needs evaluation. |
| Auditability | Wins. Deterministic and versionable. | Partial. Reproducible only if prompt, model, and seed are pinned. |
| Policy iteration speed | Loses. New patterns require code or config changes. | Wins. Policy is a prompt; iteration is text editing. |
| Determinism | Wins. Same input, same output, always. | Loses. Non-deterministic without seeded sampling and a pinned model. |
| Scalability | Wins. Trivially horizontal. | Partial. Bounded by model throughput and budget. |
| Calibration burden | Wins. Rules either match or do not. | Loses. Needs labeled data, agreement metrics, and ongoing calibration. |
The pattern across rows is consistent: rules win on speed, cost, determinism, and auditability; LLMs win on coverage, nuance, multilingual reach, and policy iteration speed. The best production systems pay the cost of both only on the slice of traffic where each tier earns its keep.
When rules are the right answer alone
- Clearly enumerable harms (known terms, known hashes, known URLs).
- Sub-100 ms latency budgets that no LLM call fits.
- Extremely high volume on a strict cost ceiling.
- Regulatory surfaces where every decision must be reproducible from a versioned ruleset.
When LLMs are the right answer alone
- Low-volume, high-stakes review where latency is not a constraint and policy is complex.
- Multilingual surfaces where maintaining per-language rule packs is infeasible.
- Rapidly evolving abuse spaces where authors of patterns cannot keep up with attackers.
- Surfaces where a written justification is itself a product requirement (appeals, trust-and-safety reports).
Limitations
Both approaches have well-known soft spots, and a serious moderation program plans for them:
- Rules drift in the other direction. They do not get worse on their own, but they slowly stop covering the live distribution of abuse. The rule list rots if no one curates it. Rule decay is a signal that the moderation system has degraded and the golden dataset needs refreshing.
- Normalization is the silent failure mode of rules. Most "rule miss" incidents trace back to a unicode trick or a leetspeak variant the normalizer did not anticipate. Regression tests against the golden set catch these gaps before deployment.
- LLM judges drift with model updates. A judge calibrated on one model version can quietly slip when the provider ships a new checkpoint. Re-calibration is a recurring task. Judge drift is a signal that the moderation system itself has changed and must be re-evaluated.
- Judge agreement loss is a quality gate. When a judge's agreement with the golden dataset or with human raters drops below threshold, the system has failed its quality gate and a new checkpoint or policy is required before deployment.
- LLM judges are themselves attackable. Prompt-injection-style payloads embedded in content can subvert an LLM judge in ways that do not affect a regex.
- Neither replaces human review on the hardest cases. The highest-stakes content (CSAM, terrorism, imminent harm) belongs in a human queue regardless of what the automated tiers say.
- Cost compounds quietly. LLM moderation is cheap per call and ruinous per million. Without a sampling and routing strategy, moderation spend can exceed inference spend on the underlying product.
How to make either approach evaluable
Treat the moderation system itself (every tier and their composition) as a component under continuous evaluation. Both rule-based and LLM-based evaluators must maintain quality standards through ground truth datasets and regression gates. The system's objective (the policy) and its implementation (how the policy is enforced) must be independently measurable.
- Ground truth dataset. A versioned set of labeled examples, drawn from real production traces and red-team adversarial cases. This is the source of truth for both rule and judge evaluation. If using dimensional decomposition, label each example across all relevant violation dimensions (0-1 scores for toxicity, harassment, policy violations, etc.).
- Regression tests in CI. Any change to a rule, a classifier, or a judge prompt must clear the ground truth set before shipping. Failures block deployment.
- Tier-level gates. Each tier (rules, lightweight classifier, LLM judge) is scored independently against its portion of the ground truth set. Regressions in any tier trigger alerts.
- Dimensional measurement. If the judge decomposes the policy into independent dimensions, measure agreement on each dimension separately. A judge that drifts on "harassment" but not "toxicity" signals a specific problem requiring targeted recalibration, not a blanket retraining.
- Pipeline-level gates. The composed moderation system (all tiers together) must maintain agreement targets with a held-out test set. Degradation triggers re-evaluation or retraining.
- Sampled scoring in production. A fraction of decisions, allow and block alike, is re-scored by a stronger judge or a human, and disagreement is tracked over time.
- Model-agnostic monitoring. The quality gates and objectives must be independent of the specific model or implementation. If the judge swaps model versions, the same ground truth dataset and thresholds must still apply; drift is a signal to recalibrate, not rebuild.
- Drift dashboards. Plot allow rate, block rate, category mix, and tier-specific metrics per surface. Plot per-dimension agreement scores if decomposed. Sudden shifts are usually the first sign that a rule has rotted, a model has drifted, or the live distribution has changed.
- Appeal loop. User-driven appeals are a free source of high-quality labels for both the rule set and the judge prompt. Appeals are immediately added to the ground truth dataset and trigger re-evaluation.
This loop is the same loop used in eval-driven AI observability for any other production model behavior. Moderation is not a special case; it is an evaluable system like any other. The separation of objective (policy) from implementation (mechanism) is the key to building moderation systems that can evolve independently of any single technology.
Evidence and sources
- OpenAI Moderation API documentation, https://platform.openai.com/docs/guides/moderation, for the reference shape of a hosted LLM moderation endpoint.
- LLM Guard, https://protectapi.com/llm-guard, for an open-source example of a hybrid rules-plus-model guardrail stack.
- Production moderation write-ups from multiple platforms (latency, cost, and escalation figures cited above are reported in similar form across industry documentation, though not separately re-verified for this post).
Further numeric claims in this post (latency, cost, escalation rates) are stated qualitatively and should be re-measured on your own traffic before being used in planning.
Related reading
- How to Build Eval-Driven AI Observability for Agents
- LLM as a Judge vs. Human Evaluation
- How do you read and interpret LLM metrics?
FAQ
Is rule-based moderation obsolete now that LLMs can read context? No. Rules still win on latency, cost, determinism, and auditability. Most production moderation stacks keep rules as the first tier and use LLMs for the slice rules cannot judge.
Can an LLM judge fully replace a human reviewer? Not for the highest-stakes categories. LLM judges are excellent at scaling moderation and producing written justifications, but illegal content and imminent-harm cases belong in a human queue with proper safeguards.
How do I decide what fraction of traffic should reach the LLM? Start by measuring how often your rule tier returns "uncertain" or borderline scores. Send that slice to the LLM. Tune the escalation threshold until your cost budget and recall target both hold on a labeled validation set.
Do small LLMs work as judges, or do I need a frontier model? Small, instruction-tuned LLMs (including distilled or fine-tuned safety models) often match frontier models on narrow moderation rubrics at a fraction of the cost. The right size depends on policy complexity and on how well-calibrated your judge prompt is.
How often should I recalibrate an LLM moderator? At minimum, whenever the underlying model version changes, whenever the policy prompt changes, and on a fixed cadence (monthly or quarterly) against a refreshed ground truth set. Treat drift as a continuous process, not a one-time event. Make recalibration automatic: if agreement with the ground truth dataset drops below your quality gate on any dimension, the system should alert and block deployment until re-calibrated. Separate degradation per dimension (e.g., harassment agreement dropped but toxicity held steady) from blanket drift; it often signals a more targeted fix than a full retraining.
What about prompt-injection attacks against the LLM judge? Treat the judge's input as untrusted. Strip or escape control sequences, use a system prompt that explicitly subordinates user content, and run a deterministic pre-filter against known injection patterns before the LLM call. Injection attempts should be tracked as a separate metric and included in the golden dataset for continuous evaluation of judge robustness.