AI

Meta’s benchmarks for its new AI models are a bit misleading

One of the new flagship AI Models Meta released on Saturday, Maverick, is in second place on LM ArenaA test with which human assessors compare the output of models and choose what they prefer. But it seems that the version of Maverick that Meta has been implemented in LM Arena differs from the version that is available on a large scale for developers.

When multiple Ai researchers Worked on X on X, Meta noted in his announcement that the Maverick on LM Arena is an ‘experimental chat version’. A graph on the Official Llama websiteIn the meantime, the LM Arena tests of Meta reveals with the help of “Llama 4 Maverick Optimidate for conversationality.”

As we have written before, for various reasons, LM Arena has never been the most reliable measure for the performance of an AI model. But AI companies have generally not adjusted their models or otherwise demonstrated to better score on LM Arena or at least have not admitted this.

The problem with adjusting a model on a benchmark, withholding and then releasing a “vanilla” variant of the same model is that it makes a challenge for developers to predict exactly how well the model will perform in certain contexts. It is also misleading. In the ideal case, benchmarks – miserably insufficient as they are – offer a snapshot of the strengths and weaknesses of a single model about a series of tasks.

Indeed, researchers of X have observed stark Differences in behavior From the publicly downloadable Maverick compared to the model hosted on LM Arena. The LM Arena version seems to use many emojis and gives incredibly long-winded answers.

We have contacted Meta and Chatbot Arena, the organization that maintains LM Arena, for comments.

See also  Meta's LLM Compiler: Innovating Code Optimization with AI-Powered Compiler Design



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button