Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks


Just a few weeks ago, Google debuted its Gemini 3 model and claims it scores a leading position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that: vendor-provided.
A new vendor-neutral evaluation of Fruitfulhowever, places Gemini 3 at the top of the rankings. This is not based on a set of academic benchmarks; rather, it’s about a set of real-world features that real users and organizations care about.
Prolific was founded by researchers from the University of Oxford. The company provides high-quality, reliable human data to power rigorous research and ethical AI development. The company “HUMAINE benchmarkapplies this approach by using representative human samples and blind tests to rigorously compare AI models across a variety of user scenarios, measuring not only technical performance but also user confidence, adaptability, and communication style.
The latest HUMAINE test evaluated 26,000 users in a blind test of models. In the evaluation, Gemini 3 Pro’s trust score rose from 16% to 69%, the highest ever measured by Prolific. Gemini 3 now ranks first in trust, ethics and security across demographic subgroups 69% of the time, compared to its predecessor Gemini 2.5 Pro, which ranked first only 16% of the time.
Overall, Gemini 3 ranked first in three of the four evaluation categories: performance and reasoning, interaction and adaptability, and trust and security. It lost only in communication style, with DeepSeek V3 surpassing preferences by 43%. The HUMAINE test also showed that Gemini 3 performed consistently well across 22 different user demographic groups, including variations in age, gender, ethnicity and political orientation. The evaluation also showed that users are now five times more likely to choose the model in mutual blind comparisons.
But the ranking matters less Why it won.
“It’s the consistency across a very wide range of different use cases, and a personality and a style that appeals to a wide range of different user types,” Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. “While in some specific cases other models may be preferred by small subgroups or of a particular conversation type, it is the breadth of knowledge and flexibility of the model across a range of different use cases and audience types that made it possible to win this specific benchmark.”
How blinded testing reveals what academic benchmarks miss
HUMAINE’s methodology exposes gaps in the way the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don’t know which suppliers make each answer possible. They discuss all topics that are important to them, not predetermined test questions.
It’s the monster itself that matters. HUMAINE uses representative samples of American and British populations, controlling for age, gender, ethnicity and political orientation. This reveals something that static benchmarks can’t capture: model performance varies across audiences.
“If you take an AI leaderboard, the majority of them could still have a fairly static list,” Bradley said. “But for us, if you look at the audience, we get a slightly different ranking whether you look at a left-wing, right-wing, American or British sample. And I think age was actually the most different condition in our experiment.”
This is important for companies that use AI among diverse employee populations. A model that performs well for one target group may underperform for another.
The methodology also addresses a fundamental question in AI evaluation: why use human judges at all if AI could evaluate itself? Bradley noted that his company is using AI judges in certain use cases, though he emphasized that human evaluation is still the critical factor.
“We see the greatest benefit from the smart orchestration of both LLM judges and human data, both have strengths and weaknesses, which when cleverly combined perform better together,” said Bradley. “But we still think human data is the alpha. We’re still extremely optimistic that human data and human intelligence should be on top of things.”
What trust means in AI evaluation
Trust, ethics and security measure the user’s confidence in reliability, factual accuracy and responsible behavior. In HUMAINE’s methodology, trust is not a supplier claim or a technical benchmark; it’s what users report after blind conversations with competing models.
The 69% figure represents the probability for demographic groups. This consistency is more important than the total scores because organizations can serve diverse populations.
“There was no awareness that they were using Gemini in this scenario,” Bradley said. “It was just based on the blinded multi-turn response.”
This separates perceived trust from earned trust. Users reviewed the model results without knowing which vendor produced them, eliminating Google’s brand advantage. For customer-facing implementations where the AI supplier remains invisible to end users, this distinction is important.
What companies should do now
One of the crucial things companies need to do now when considering different models is to embrace an evaluation framework that works.
“It is becoming increasingly challenging to evaluate models based solely on vibration,” says Bradley. “I think we need increasingly rigorous, scientific approaches to really understand how these models perform.”
The HUMAINE data provides a framework: testing for consistency across use cases and user demographics, not just peak performance on specific tasks. Blind tests to separate model quality from brand perception. Use representative examples that match your actual user population. Plan for continuous evaluation as models change.
For companies looking to deploy AI at scale, this means moving beyond “which model is best” to “which model is best for our specific use case, user demographics, and required features.”
The accuracy of representative samples and blind testing provides the data to make that determination – something that engineering benchmarks and vibration-based evaluation cannot provide.




