Ensuring the Safety of Healthcare AI with LLM Judges

Healthcare is a unique business with unusually high stakes. Advising for a wrong treatment, relying on missing or mismatching documentation, or providing doctors with hallucinated information will hurt not only profits but also patients. This creates the need for new solutions and companies.

The client

Gosta Labs is a Finnish company dedicated to building robust and secure machine learning models for healthcare. Its offerings include Gosta Aide, which streamlines and supports administrative tasks; Gosta Life Sciences, which focuses on applications in the life sciences sector; and Ambient Clinical Documentation, an AI-powered assistant that creates clinical notes in real time during patient appointments.

Taking notes during a medical consultation is a perfect use case for generative AI and large language models. Instead of typing while the patient is present, physicians can focus entirely on care while the AI assistant records details in the background. Once the consultation ends, the system automatically generates a clinical note for review or modification.

"With our technology, up to 70% of time can be saved, freeing resources for better treatment or caring for more patients," explains Henri Viertolahti, Chief Product Officer at Gosta Labs.

The challenge

Healthcare is a sector where accuracy and safety are paramount. Advising incorrect treatments, overlooking critical documentation, or generating hallucinated information doesn't just impact revenue—it endangers patients' lives. With stricter regulations, including the upcoming EU AI Act, healthcare providers must ensure compliance and reliability at scale.

As Henri Viertolahti notes, "Human testers are important, but to build products in a scalable and sustainable way, automated testing is the only viable path."

The solution

To address these challenges, Gosta Labs integrates the Scorable platform into its development workflow. Scorable enables automated testing across every model iteration, detecting errors, hallucinations, and measuring conciseness, among other evaluators. This ensures that each new clinical note model or healthcare AI application is validated before deployment.

"With Scorable, we can compare new model versions against existing baselines, refine outputs, and make sure every change actually improves performance," says Viertolahti.

About Scorable

Scorable is a platform that empowers organizations across industries to measure, evaluate, and control large language model behavior through automated, scalable evaluators. Their mission is to make AI more reliable by detecting hallucinations, ensuring consistency, and continuously benchmarking model performance in production environments.

A key innovation from Scorable is Root Judge — a specialized LLM designed to evaluate other LLMs. It can identify unsupported or fabricated outputs, flag risks, and provide transparent justifications for its assessments. This helps companies deploy generative AI with greater safety and accountability.

From healthcare to enterprise AI, Scorable supports teams that want to scale responsibly without sacrificing reliability. By providing the tools to detect hallucinations and validate outputs at every iteration, Scorable helps innovators like Gosta Labs deliver safer, more effective solutions for real-world use.