Updated: 2026-05-15 By: Ari Heljakka
The tedious problem
Here is what operating production AI in a regulated environment looks like right now, in most organizations:
- A compliance analyst manually reviews a sample of chatbot transcripts every two weeks. The sample is small because reviewing at scale is expensive and slow.
- The criteria drift quietly from sprint to sprint because the evaluation rubric lives in a shared document that engineers update when something breaks.
- After an incident, the team scrambles to reconstruct what the system was doing at the moment of failure, working backward through logs that were not designed to answer audit questions.
- A board or regulator asks for evidence that the AI system behaves within policy, and the answer is a slide deck assembled in four days that nobody is fully confident in.
Meanwhile, the automated test suite passes. Latency is fine. Error rates are within SLA. The system looks healthy by every operational metric, and nobody has a credible answer to the question: "Is it actually doing what we said it would do?"
This is not just a tooling problem. Better evaluation tooling helps, but it does not close the gap, because the team that built better tooling is the same team that chose which behaviors to test, which thresholds to set, and what counts as a passing score. A tighter version of that process still inherits the same incentives. The gap is not that the engineering team is careless or dishonest. It is that the engineering team cannot independently verify their own choices. That is a structural problem, and tooling cannot solve a structural problem.
What production AI is missing
Production systems have had to invent new functions before when old structures stopped working; that instinct is the right one here. The analogy we will lean on is older and tighter: financial audit.
A company's accounting team is skilled, well-intentioned, and deeply familiar with the numbers. They still cannot audit themselves. The reason is not moral; it is procedural. The team that produces a set of accounts has an inherent conflict of interest in evaluating whether those accounts are accurate. The audit function exists to close that conflict by design, not by goodwill.
The same logic applies to AI systems. The engineering team that selected the model, designed the prompt scaffolding, chose the evaluation thresholds, and built the test suite cannot independently verify that those choices produce trustworthy outputs. The problem here is not primarily one of intent; it is one of orientation. Engineers building an AI system face two structural blind spots:
- They are not strongly motivated to obsess about issues that could arise from user inputs never represented in the original test sets. Their incentive is to ship something that handles the cases they have already enumerated. The long tail of inputs the actual user base will produce is not on their dashboard, and there is no natural pressure to go looking for it.
- They are not naturally inclined to use the scoring rubric as a tool for maximizing user value. They use it as a tool for avoiding failure. Those are not the same thing. A rubric oriented around "did this output fail" answers a different question from "did this output give the user what they actually came for."
The gap is not competence; it is structure. And the structure requires a function that sits outside the engineering organization, watches what the system actually does in production, and produces a record the people who built the system cannot quietly edit.
Call this function the AI Auditor.
What an AI Auditor does
The AI Auditor function has four components. Together they form something that does not exist in most organizations today: continuous, independent, evidence-grade evaluation of production AI.
-
It watches production traffic. Not a sample reviewed retrospectively in a spreadsheet. A continuous significant sample of requests, responses, and intermediate steps, captured with full trace metadata, instrumented at the runtime or proxy level. The observation layer is not optional; you cannot audit what you cannot see.
-
It scores against criteria that cannot silently change. This is the structural control that makes evaluation credible. The scoring criteria, the judge definitions, and the thresholds live in a separate namespace, access-controlled so that every change to the evaluation criteria is tracked. An edit to the rubric will be explicitly shown; that process produces its own versioned entry in the audit log. Policy-based thresholds for error tolerance are explicit first-class citizens that do not equal the ever-changing technical thresholds.
-
It produces a record a third party can use. A dashboard tells you what the system is doing right now. A durable record tells a third party what the system was doing on a specific date, evaluated against a specific version of the rubric, with a specific outcome, and it does so in a form that cannot be retroactively adjusted. This is the deliverable that survives a board review, a regulatory examination, or a procurement due-diligence process.
-
It is built towards exposing not just errors but also weak business delivery. The auditor signal itself becomes an optimization target for value creation, not just risk minimization.
-
Its outputs are executive-readable, not engineering-readable. The AI Auditor function translates continuous technical evaluation into findings classified by severity, remediation status, and trend. A Chief Risk Officer reading an AI audit report should be able to understand what was found, what was done about it, and whether the situation improved, without access to the underlying trace data.
What it is not
The AI Auditor function is distinct from several things organizations may already have:
-
Not what a monitoring vendor sells. Observability-oriented tools like LangSmith or Braintrust can help engineering teams build better AI. They surface failure modes, help with prompt iteration, and make debugging faster. They are engineering tools, built for the engineering team, with the engineering team in control of every configuration. That is their purpose; it is not a criticism. But engineering observability is not the same as independent evaluation, for exactly the reason described above.
-
Not what the Big Four sell, though it is the natural input to what they sell. Consulting and audit firms can assess AI governance frameworks, provide opinions on model risk, and produce third-party attestation. For that work to be credible, they need documentation produced continuously and maintained independently of the team being assessed. The AI Auditor function is what produces that documentation.
-
Not a product pretending to be a statutory authority. Scorable is not a regulator or legal certifier. Its AI Auditor provides procedural, not institutional, independence: controlled access, auditable rubric changes, and externally reviewable methodology. Unlike a Big Four opinion, it does not transfer liability. It does, however, structurally prevent teams from changing the criteria by which they are evaluated.
What changes when the role exists
The absence of an AI Auditor function produces a particular kind of organizational dysfunction: accountability theater. The engineering team knows the system has gaps; they fix what they can find. Compliance asks whether the system is safe; engineering says yes, with caveats. Nobody is lying, but nobody is producing the kind of structured, independent record that would let a third party reach an informed conclusion.
When the function exists, several things change:
-
Incident response becomes faster. When a failure occurs, the investigation starts from a complete, pre-existing evidence trail rather than a scramble to reconstruct what happened. The AI Auditor already knows what was evaluated, when, against what criteria, and what the outcome was. Root-cause analysis moves from days to hours.
-
Remediation closes the loop. Finding a problem is necessary but not sufficient. An AI Auditor function tracks whether remediation actually worked: the same judge that identified the failure evaluates the fix, and the outcome is recorded in the same evidence stream. You can demonstrate not only that you identified a problem but that you fixed it, and that the fix held.
-
Accountability becomes defensible. When a regulator or board asks whether the AI system was behaving within policy on a specific date, the answer is not a presentation prepared in the week before the meeting. It is a versioned record produced continuously and available on request.
-
Manual review shrinks to where it adds value. Human review of AI outputs is valuable at the margins: novel failure modes the automated judges were not designed to catch, cases that require judgment about regulatory intent rather than technical criteria. When continuous automated evaluation handles the volume, human reviewers can focus on the cases where human judgment is genuinely irreplaceable.
The regulatory moment
The external pressure is arriving at the same time as the internal need. The regulatory frameworks differ in emphasis, but they share a common requirement: records that someone other than the team that built the system can use to reach an independent conclusion.
Consumer Duty wants outcomes evidence. Firms must show that products and services are actually delivering good outcomes for customers, not just that the product was designed with good intentions. An AI system that advises, recommends, or decides on behalf of customers needs continuous documentation of what it actually did, not a retrospective claim.
The EU AI Act, for high-risk applications, sets out requirements for technical logging and traceability (Article 12), alongside post-market monitoring and record-keeping obligations, and oversight measures where explicitly required. Records that exist only on paper, or that the development team alone controls, do not meet the intent of the requirement.
MiFID II treats AI-assisted investment advice the same as human advice for record-keeping purposes. The evidence trail has to be equivalent, which means it has to be continuous, versioned, and available for examination.
The common thread is not that these frameworks all require a third-party auditor. The common thread is that they all require records a third party can actually use. The AI Auditor function is what makes that possible.
Where we go from here
The AI Auditor is not a role that replaces what engineering teams do. It is a role that makes engineering-team evidence credible to people outside the engineering team. That distinction matters because most of the pressure arriving on AI-deploying organizations right now is coming from people outside the engineering team: boards, regulators, risk committees, procurement reviewers.
Building this function is not simple, and it is not cheap. It requires instrumentation, access control, change-management process, and the discipline to maintain a rubric even when it surfaces uncomfortable findings. What it produces, when it works, is something genuinely new in most organizations: the ability to say with evidence, not just assertion, that your AI system is doing what you said it would do.