AI

Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

It’s not the intelligence of AI models that is holding back enterprise implementations. It is primarily the inability to define and measure quality.

That’s where AI judges are now playing an increasingly important role. In AI evaluation, a “judge” is an AI system that evaluates the results of another AI system.

Judge Builder is Databricks’ framework for creating judges and was first deployed as part of the company’s Officer Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and implementations.

Early versions focused on technical implementation, but customer feedback showed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts, and deploying evaluation systems at scale.

“The intelligence of the model is typically not the bottleneck, the models are very smart,” Jonathan Frankle, chief AI scientist at Databricks, told VentureBeat in an exclusive briefing. “Instead, it’s really about how do we get the models to do what we want, and how do we know if they did what we wanted?”

The ‘Ouroboros problem’ of AI evaluation

Judge Builder addresses what Pallavi Koppol, a Databricks researcher who led the development, calls the “Ouroboros problem.” An Ouroboros is an ancient symbol that depicts a snake eating its own tail.

Using AI systems to evaluate AI systems creates a circular validation challenge.

“You want a judge to see whether your system is good, whether your AI system is good, but then your judge is also an AI system,” Koppol explains. “And now you say: how do I know if this judge is good?”

The solution is to measure the “distance to the ground truth of human experts” as the primary scoring function. By closing the gap between how an AI judge evaluates results and how domain experts evaluate them, organizations can trust these judges as scalable proxies for human evaluation.

See also  Moxie Marlinspike has a privacy-conscious alternative to ChatGPT

This approach is fundamentally different from traditional guardrail systems or single metric evaluations. Rather than asking whether an AI output passes or fails a general quality check, Judge Builder creates very specific evaluation criteria tailored to each organization’s domain expertise and business requirements.

The technical implementation also sets it apart. Judge Builder integrates with Databricks’ MLflow and fast optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time, and deploy multiple judges simultaneously across different quality dimensions.

Lessons learned: Building judges that actually work

Databricks’ work with enterprise clients revealed three critical lessons that apply to anyone building AI judges.

Lesson one: Your experts don’t agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response may be factually correct but use an inappropriate tone. A financial summary can be comprehensive, but too technical for the intended audience.

“One of the biggest lessons of this whole process is that all problems become people problems,” Frankle said. “The hardest part is getting an idea out of someone’s brain and turning it into something explicit. And the harder part is that companies are not one brain, but many brains.”

The solution is a batch annotation with inter-rater reliability checks. Teams annotate examples in small groups and then measure agreement scores before moving on. This ensures that misalignment is detected early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before the discussion revealed that they interpreted the evaluation criteria differently.

Companies using this approach achieve interrater reliability scores as high as 0.6, compared to typical scores of 0.3 from third-party annotation services. Higher agreement directly translates into a better performance assessment because the training data contains less noise.

See also  Grok 3 appears to have briefly censored unflattering mentions of Trump and Musk

Lesson Two: Break down vague criteria into specific judges. Instead of one judge judging whether a response is ‘relevant, factual and concise’, you can create three separate judges. Each focuses on a specific quality aspect. This granularity is important because a failing “overall quality” score shows that something is wrong, but not what needs to be fixed.

The best results come from combining top-down requirements such as regulatory restrictions and stakeholder priorities, with the bottom-up discovery of observed failure patterns. One customer built a top-down rater for correctness, but discovered through data analysis that correct answers almost always reported the top two retrieval results. This insight became a new production-friendly judge that could guarantee accuracy without the need for ground-truth labels.

Lesson three: You need fewer examples than you think. Teams can build robust judges from as few as 20-30 well-chosen examples. The key is to select edge cases that reveal disagreement, rather than obvious examples that everyone agrees with.

“We can do this process in as little as three hours on some teams, so it won’t take that long to get a good jury,” Koppol said.

Production results: from pilots to seven-figure implementations

Frankle shared three metrics Databricks uses to measure Judge Builder’s success: whether customers want to use it again, whether they increase AI spend, and whether they progress in their AI journey.

On the first metric, one client created more than a dozen judges after their first workshop. “This client gained more than a dozen judges after we rigorously guided them with this framework for the first time,” Frankle said. “They really started working with the judges and are now measuring everything.”

For the second metric, the business impact is clear. “There are several customers who have taken this workshop and are spending money on GenAI at Databricks in a way that they hadn’t before,” Frankle said.

See also  Who are AI browsers for?

The third metric reveals Judge Builder’s strategic value. Customers who previously hesitated to use advanced techniques such as reinforcement learning can now use them with confidence because they can measure whether improvements have actually occurred.

“There are clients who have started doing very sophisticated things after having these judges where they were reluctant before,” Frankle said. “They’ve gone from a little bit of quick engineering to reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning when you don’t know if it really made a difference?”

What companies should do now

The teams that have successfully moved AI from pilot to production are treating judges not as one-off artifacts, but as evolving assets that grow with their systems.

Databricks recommends three practical steps. Focus first on high-impact judges by identifying one critical legal requirement plus one perceived failure mode. These will become your initial jury portfolio.

Second, create lightweight workflows with subject matter experts. A few hours of reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batch annotations and inter-rater reliability checks to remove the noise from your data.

Third, schedule regular judge reviews using production data. As your system evolves, new failure modes will emerge. Your court portfolio should evolve with them.

“A judge is a way to evaluate a model. It’s also a way to create guardrails. It’s also a way to have a metric against which you can do quick optimization and it’s also a way to have a metric against which you can do reinforcement learning,” Frankle said. “Once you have a judge that you know represents your human taste, in an empirical form that you can question as much as you want, you can use it in 10,000 different ways to measure or improve your agents.”

Source link

Back to top button