Updated: 2026-03-11 By: Ari Heljakka
Short answer
LLMs are confident but not trustworthy. They generate fluent text from language patterns rather than from verified facts, so identical prompts can produce different outputs, false claims can read as confident ones, and the same model that summarizes a contract well will compute a tax liability badly. The reliable pattern is hybrid: put the LLM where its strength matters (parsing unstructured input, producing natural language, suggesting candidates) and put a deterministic system where reliability matters (calculation, schema validation, policy enforcement, irreversible action). The LLM proposes; the deterministic system disposes; the evaluation layer measures the seam between them.
Key facts
- Definition: A hybrid LLM-plus-deterministic system uses the LLM as a fuzzy front-end or back-end for tasks where natural language is the input or output, and routes high-stakes operations through deterministic code that can be audited and tested.
- When to use: Any production system where wrong answers have material cost: financial, clinical, legal, safety-critical, irreversible.
- Limitations: Designing the seam is harder than building either side alone; the deterministic side may need to grow as the LLM's failure modes become visible.
- Example: A document automation system uses an LLM to extract fields from invoices, then routes each extracted total through a deterministic ledger that re-computes and reconciles before writing to the accounting system.
Key takeaways
- LLMs are not reliable components. They are fluent, useful, and probabilistic. Treat them that way.
- Reliability is a system property, not a model property. A reliable system contains an unreliable model bounded by deterministic checks.
- The seam between LLM and deterministic system is the highest-value design surface in the architecture.
- Evaluation lives at the seam. Every LLM output that flows into deterministic code is a measurable event.
- The reliable pattern scales because the deterministic side is testable, versionable, and auditable in the ordinary software-engineering sense.
Definition
An LLM is a fuzzy component: its outputs are samples from a distribution, not deterministic computations. The same input may yield different outputs on different runs. The model has no internal mechanism to verify a fact; it generates the most plausible-sounding next tokens given its training.
A deterministic system is software whose output is a function of its input: given the same input, you get the same output. Schema validators, arithmetic libraries, rule engines, type checkers, database transactions are all deterministic.
A hybrid reliable system routes work so that each component does what it is good at:
- The LLM handles unstructured input, generates natural language, suggests candidate answers, and interfaces with humans.
- The deterministic system computes, validates, enforces, and commits.
- The seam between the two carries structured handoffs (typed payloads, not free text) and is the locus of evaluation.
Reliability is the property of the whole system. It does not require a reliable LLM; it requires a system designed so the LLM's failure modes are bounded.
When this matters
The hybrid pattern earns its keep when at least one of these holds:
- The cost of a wrong answer is high (a financial penalty, a clinical error, a legal exposure, a user-visible failure that erodes trust).
- The output must be reproducible (an audit trail requires that the same input produced the same answer).
- The output must be verifiable (a downstream system or a regulator needs to check the answer against a ground truth).
- The operation is irreversible (sending money, prescribing medication, dispatching a vehicle).
- The system is required to behave consistently across a population of similar inputs.
A creative writing assistant or a brainstorming partner has none of these. The hybrid pattern is overhead for those use cases.
How it works
The hybrid pattern has three structural pieces: the LLM's role, the deterministic system's role, and the seam between them.
Where the LLM goes
LLMs are good at:
- Parsing unstructured input. Pulling a date, a total, an address out of a free-text invoice; extracting intent from a user's question.
- Producing natural language. Drafting a response, summarizing a document, explaining a calculation to a human.
- Suggesting candidate answers. Proposing a tool to call, a query to run, an action to take, for the deterministic system to verify.
- Bridging modalities. Converting a voice request into a structured command; turning a structured event into a human-readable notification.
In each case the LLM's output is a candidate, not a commitment. Something else verifies.
Where the deterministic system goes
Deterministic systems are good at:
- Calculation. Tax math, inventory math, statistical aggregation, anything where two plus two must equal four every time.
- Validation. Schema checks, business-rule enforcement, policy checks against a versioned rule set.
- Action. Database transactions, API calls that change state, anything where reproducibility and auditability matter.
- Verification. Cross-checking an LLM-proposed answer against authoritative data, recomputing a value, looking up a fact in a system of record.
The deterministic side does not need to be clever. It needs to be correct, fast, and auditable.
The seam: structured handoff and continuous evaluation
The seam is where reliability is won or lost. Two rules:
- Handoffs are typed. The LLM emits structured payloads (JSON, function calls, tool arguments) into the deterministic system. The deterministic system validates the payload before doing anything else.
- The seam is measured. Every handoff is a measurable event. Was the schema valid? Did the deterministic check accept the proposed value? Did the human review override the LLM's suggestion?
Evaluation at the seam produces concrete dimensions:
- Schema validity. What fraction of LLM outputs pass schema validation on first attempt?
- Verification pass rate. What fraction of LLM proposals are accepted by the deterministic verifier?
- Override rate. What fraction of LLM outputs are corrected by a human or by a fallback rule?
- End-to-end correctness. What fraction of the full hybrid pipeline produces the right answer?
These scores are versioned. They feed CI/CD gates the same way unit-test pass rates do.
Example
A team automates accounts-payable processing. The naive design uses an LLM end to end: extract the invoice, decide whether to pay, post the transaction. The hybrid redesign separates concerns:
- LLM, parsing. The model extracts vendor, invoice number, line items, total, due date from the PDF. Output is a typed payload.
- Deterministic, validation. A schema validator rejects payloads missing required fields or with arithmetic that does not match (line items must sum to total within a tolerance).
- Deterministic, verification. A reconciliation step compares the extracted vendor and invoice number against the purchase order system. Mismatches go to human review.
- LLM, drafting. For invoices needing human review, the LLM drafts a summary explanation for the AP clerk.
- Deterministic, action. The transaction is posted to the accounting system through an audited API.
The evaluation framework scores the seam at every step:
- Schema-validity rate on LLM extractions: 0.94 (4 dimensions, weekly).
- Reconciliation match rate: 0.91. Mismatches cluster around two vendors with non-standard invoice formats; the failure cluster is tracked and the prompt is refined.
- Human-override rate on AP-clerk reviews: 0.06. Drift up triggers an alert; drift up was caused by a model swap that the gate caught and rolled back.
- End-to-end correctness on a held-out ground-truth dataset of 200 invoices: 0.98.
The LLM is not reliable. The system is.
Limitations
Caveats worth flagging up front:
- The deterministic side grows over time. Every new failure mode the LLM exposes is a new check on the deterministic side. Budget for this growth.
- Schema design is hard. A schema too loose lets bad outputs through; a schema too tight forces the LLM into brittle outputs. Iterate the schema against real data.
- The seam can hide complexity. Pushing intelligence into the deterministic side trades model brittleness for code brittleness. Both are testable; only one is fashionable.
- Cross-modal handoffs need care. When the LLM consumes structured input and emits structured output, prompt engineering carries more weight than people expect.
- Continuous monitoring is required. The hybrid pattern is reliable only if the seam metrics are tracked. A drift on schema-validity rate is a leading indicator of a regression.
Evidence and sources
- Anthropic, Reducing hallucinations in production, https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/reduce-hallucinations, for the disciplined-handoff and verification pattern.
- OpenAI, Function calling and tool use, https://platform.openai.com/docs/guides/function-calling, for the structured-handoff schema between an LLM and downstream deterministic code.
- Inavate, AI reliability challenges, ISE 2026, https://www.youtube.com/watch?v=1kgf97lYU7E, for the "confident but not trustworthy" framing of LLM outputs.
FAQ
Is this just "use rules where rules apply"? The hybrid pattern is broader. Rules are one kind of deterministic component; calculators, validators, ledgers, and lookups are others. The point is to place the deterministic boundary at the right granularity, not to avoid LLMs.
Won't this slow my system down? Schema validation, reconciliation, and verification add latency in the tens of milliseconds, not the hundreds. The LLM call is almost always the dominant latency component. The hybrid pattern's overhead is small.
Where do agents fit in this picture? A reliable agent is a hybrid system at every step. Each tool call is a structured handoff to a deterministic system; each tool response is a structured handoff back. The agent's reliability is the product of the seam metrics at every step.
How do I justify the up-front design cost? The cost shows up in regressions you do not have, audits you can pass, and incidents you do not need to debug. Track schema-validity rate, override rate, and end-to-end correctness as leading indicators.
When is end-to-end LLM acceptable? When the cost of a wrong answer is low, the output is consumed by a human who will catch errors, and the system is not on a path to handle anything riskier. The right moment to introduce the hybrid pattern is before that path is taken, not after.