AI

Did xAI lie about Grok 3’s benchmarks?

Debates about AI -Benchmarks – and how they are reported by AI Labs – spill out in the public image.

This week, an OpenAI employee accused Elon Musk’s AI Company, Xai, from publishing misleading benchmark results for his latest AI model, grok 3. One of the co-founders of Xai, Igor Babushkin, insist that the company had the right.

The truth is somewhere in between.

In one Post on Xai’s blogThe company published a graph with the performance of Grok 3 on Aime 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have asked Aime’s validity as an AI benchmark. Nevertheless, Aime 2025 and older versions of the test are often used to investigate the mathematical ability of a model.

Xai’s graph showed two variants of grok 3, grok 3 reasoning beta and grok 3 mini reasoning, beating the best performing available model of OpenAI, O3-Mini-High, on Aime 2025. But openi employees on X were quickly on that Xai’s Graph Not included the AIME 2025 score of O3-Mini-High on “Cons@64”.

What is disadvantages@64, you might wonder? Well, it is short for “Consensus@64”, and it actually gives a Model 64 tries to answer every problem in a benchmark and takes the answers most generated as the final answers. As you can imagine, Cons@64 tends to considerably increase the benchmark scores of models, and omitting a graph can seem as if one model surpasses another when it is not in reality.

Grok 3 reasoning beta and grok 3 mini reasive’s scores for aime 2025 on “@1”-What means that the first score fell the models on the benchmark-under the score of O3-Mini-High. Grok 3 reasoning beta also always follows backwards behind the O1 model of OpenAi set on “medium” computing. Yet Xai is is Advertising Grok 3 Like the ‘smartest AI’ siesten ‘.

See also  Elon Musk’s AI company, xAI, releases its latest flagship model, Grok 3

Babushkin Quarrel about X In the past, that OpenAi has published misleading benchmark graphs in the past – albeit the graphs that compare the performance of his own models. A more neutral party in the debate put a more “accurate” graph together with the performance of almost every model on Cons@64:

But as an AI researcher Nathan Lambert Be on a messagePerhaps the most important metric remains a mystery: the computational (and monetary) costs that it cost for each model to achieve its best score. That only shows how little most AI benchmarks communicate about the limitations of models – and their strengths.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button