The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

December 11, 2025

6 3 minutes read

There is no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model in completing various useful tasks – from coding Unpleasant instruction follows Unpleasant agentic web surfing And tool use. But many of these benchmarks have one major flaw: they measure the AI’s ability to complete specific problems and requests, not how factual the model is in its output – how well it objectively generates correct information linked to real-world data – especially when it comes to information in images or graphics.

For industries where accuracy is paramount – legal, financial and medical – there is the lack of a standardized way to measure factuality has been a critical blind spot.

That changes today: Google’s FACTS team and its data science unit Kaggle has released the FACTS Benchmark Suite, a comprehensive evaluation framework designed to close this gap.

The associated research paper reveals a more nuanced definition of the problem, splitting ‘factuality’ into two different operational scenarios: ‘contextual factuality’ (basing answers on provided data) and ‘world knowledge factuality’ (retrieving information from memory or the web).

While the headline news is Gemini 3 Pro’s placement at the highest level, the deeper story for builders is the industry-wide ‘fact wall’.

According to the first results, no model – including Gemini 3 Pro, GPT-5 or Claude 4.5 Opus – managed to achieve an accuracy score of 70% on the set of problems. For tech leaders, this is a signal: the era of ‘trust but verify’ is far from over.

Deconstructing the benchmark

The FACTS suite goes beyond simple question-and-answer sessions. It is composed of four different tests, each simulating a different real-world failure mode that developers encounter in production:

Parametric benchmark (internal knowledge): Can the model accurately answer trivia-type questions using just the training data?
Search benchmark (tool usage): Can the model effectively use a web search tool to retrieve and synthesize live information?
Multimodal benchmark (vision): Can the model accurately interpret graphs, charts and images without hallucinating?
Grounding Benchmark v2 (context): Can the model strictly adhere to the supplied source text?

Google has released 3,513 samples to the public, while Kaggle has a private set to prevent developers from training on the test data – a common problem known as ‘contamination’.

The rankings: a game of centimeters

The first run of the benchmark puts Gemini 3 Pro in the lead with a comprehensive FACTS score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%). However, a closer look at the data reveals where the real battlegrounds lie for engineering teams.

Model	FACTS score (average)	Search (RAG capability)	Multimodal (Vision)
Twin 3 Pro	68.8	83.8	46.1
Twin 2.5 Pro	62.1	63.9	46.9
GPT-5	61.8	77.7	44.1
Grok 4	53.6	75.3	25.7
Claude 4.5 Opus	51.3	73.2	39.2

Data taken from the FACTS team release notes.

For Builders: The Divide Between ‘Search’ and ‘Parametric’

For developers building Retrieval-Augmented Generation (RAG) systems, the search benchmark is the most critical metric.

The data shows a huge discrepancy between a model’s ability to ‘know’ things (Parametric) and its ability to ‘find’ things (Search). For example, Gemini 3 Pro scores a high 83.8% on search tasks, but only 76.4% on parametric tasks.

This validates the current standard for enterprise architecture: don’t rely on a model’s internal memory for critical facts.

If you are building an internal knowledge bot, the FACTS results suggest that linking your model to a search tool or vector database is not optional; it is the only way to bring accuracy to acceptable production levels.

The multimodal warning

The most alarming data point for product managers is performance on multimodal tasks. The scores here are universally low. Even the category leader, Gemini 2.5 Pro, only achieved an accuracy of 46.9%.

The benchmark tasks include reading graphs, interpreting diagrams and identifying objects in nature. With less than 50% accuracy across the board, this suggests that multimodal AI is not yet ready for unsupervised data extraction.

In short: If your product roadmap involves an AI automatically extracting data from invoices or interpreting financial charts without human control, you are likely to introduce significant error rates in your pipeline.

Why this is important for your stack

The FACTS Benchmark will likely become a standard reference point for tenders. When evaluating business use models, technical leaders should look beyond the composite score and dig into the specific sub-benchmark that fits their use case:

Build a customer support bot? Look at the Grounding score to ensure the bot adheres to your policy documents. (Gemini 2.5 Pro even scored better here than Gemini 3 Pro, 74.2 versus 69.0).
Build a research assistant? Prioritize search scores.
Want to build an image analysis tool? Proceed with extreme caution.

As the FACTS team noted in their press release, “All models evaluated achieved an overall accuracy of less than 70%, leaving significant room for future progress.” For now, the message to the industry is clear: the models are getting smarter, but they are not yet infallible. Design your systems with the assumption that the raw model might be wrong about a third of the time.

Source link