Meta’s benchmarks for its new AI models are a bit misleading

April 6, 2025

0 1 minute read

One of the new flagship AI Models Meta released on Saturday, Maverick, is in second place on LM ArenaA test with which human assessors compare the output of models and choose what they prefer. But it seems that the version of Maverick that Meta has been implemented in LM Arena differs from the version that is available on a large scale for developers.

When multiple Ai researchers Worked on X on X, Meta noted in his announcement that the Maverick on LM Arena is an ‘experimental chat version’. A graph on the Official Llama websiteIn the meantime, the LM Arena tests of Meta reveals with the help of “Llama 4 Maverick Optimidate for conversationality.”

As we have written before, for various reasons, LM Arena has never been the most reliable measure for the performance of an AI model. But AI companies have generally not adjusted their models or otherwise demonstrated to better score on LM Arena or at least have not admitted this.

The problem with adjusting a model on a benchmark, withholding and then releasing a “vanilla” variant of the same model is that it makes a challenge for developers to predict exactly how well the model will perform in certain contexts. It is also misleading. In the ideal case, benchmarks – miserably insufficient as they are – offer a snapshot of the strengths and weaknesses of a single model about a series of tasks.

Indeed, researchers of X have observed stark Differences in behavior From the publicly downloadable Maverick compared to the model hosted on LM Arena. The LM Arena version seems to use many emojis and gives incredibly long-winded answers.

Okay Lama 4 is certainly a small cooked fun, what is this Yap City pic.twitter.com/Y3GVHBVZ65

– Nathan Lambert (@natolambert) April 6, 2025

For some reason, the Llama 4 -model in Arena uses many more emojis

Together together. Ai, it seems better: pic.twitter.com/f74odx4zttt

– Tech Dev Notes (@techdevnotes) April 6, 2025