OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

April 21, 2025

0 3 minutes read

A discrepancy between the benchmark results of the first and third parties for the O3 AI model of OpenAi is Raise questions about the transparency of the company And model test methods.

When OpenAi revealed O3 in December, the company claimed that the model could answer a little more than a fourth of questions about Frontimath, a challenging series of mathematical problems. That score blew away the competition on the second best model managed to correctly answer only about 2% of frontimath problems correctly.

“Today all offers have less than 2% [on FrontierMath]Mark Chen, Chief Research Officer at OpenAI, said during a live stream. ‘We see ourselves [internally]With O3 in aggressive test time calculations, we can get more than 25%. “

It appears that that figure was probably an upper limit, reached by a version of O3 with more computer behind it than the OpenAI model was publicly launched last week.

Epoch Ai, the research institute behind Frontimath, published results from his independent benchmarkts from O3 on Friday. Epoch discovered that O3 scored around 10%, well below the highest score of OpenAi.

OpenAi has released O3, their long-awaited reasoning model, together with O4-Mini, a smaller and cheaper model that follows O3-Mini.

We have evaluated the new models on our Suite of Mathematics and Science Benchmarks. Results in thread! pic.twitter.com/5gbtzkey1b

– Epoch ai (@epochairSearch) April 18, 2025

That does not mean that OpenAi lied in itself. The benchmark results that the company published in December show a lower score that corresponds to the observed score era. Epoch also noted that the test setup is probably different from OpenAi’s, and that it used an updated release of frontimath for its evaluations.

“The difference between our results and OpenAi’s can be due to OpenAI that evaluates with a more powerful internal scaffolding, with more test time [computing]Or because those results were carried out on another subset of Frontimath (the 180 problems in Frontiermath-2024-11-26 versus the 290 problems in Frontierath-2025-02-28-Private), ”,” written Era.

According to a message on x From the Arc Prize Foundation, an organization that tested a pre-release version of O3, the public O3 model ‘is another model […] tailored to chat/product use, ”the report of EPOC confirms.

“All released O3 calculation layers are smaller than the version we [benchmarked]”Wrote Arc Prize. In general, it can be expected that larger mathlocks will reach better benchmark scores.

Re-tests released O3 on ARC-Agi-1 takes a day or two. Because today’s release is a equipment different system, we can re -label our reported results in the past as “preview”:

O3 preview (Low): 75.7%, $ 200/Task
O3 preview (high): 87.5%, $ 34.4k/task

O1 Pro prices …

– Mike Knoop (@mikekknoop) April 16, 2025

OpenAi’s own Wenda Zhou, a member of the technical staff, said last week during a live stream That the O3 in production “is more optimized for Real-World use cases” and speed versus the version of O3 Demoned in December. As a result, the benchmark can show ‘differences’, he added.

‘[W]E ‘. [optimizations] around the [model] More cost -efficient [and] In general, “said Zhou.” We still hope – we still think that – this is a much better model […] You don’t have to wait that long if you ask for an answer, which really is something with this [types of] models. “

Admittedly, the fact that the public release of O3 falls short of the test promises of OpenAI is a bit of a dispute, because the O4-mini models of the company will perform better in the coming weeks than O3 on Frontiermath and OpenAi plans a more powerful O3 variant, O3-Pro.

However, it is another memory that AI -Benchmarks cannot be taken at the nominal value – especially when the source is a company with services to sell.

Benchmarking “controversies” are a common appearance in the AI industry as suppliers racing to conquer newspaper heads and mindshare with new models.

Epoch was criticized in January because he waited to announce the financing of OpenAi until the company had announced O3. Many academics who contributed to Frontimath were only aware of OpenAi’s involvement until it was made public.

More recently, Elon Musk’s Xai was accused of publishing misleading benchmark graphs for his latest AI model, Grok 3. This month Meta admitted that he had recommended Benchmark scores for a version of a model that differed from the company that made the company available for developers.

Updated 4:21 PM Pacific: comments added from Wenda Zhou, a member of the OpenAI Technical Staff, of a live stream last week.

Source link