Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

April 12, 2025

0 1 minute read

Earlier this week, Meta landed in hot water for the use of an experimental, non -released version of his LLAMA 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM Arena. The incident has asked the LM Arena under holders to apologizeChange their policy and score the unaltered, vanilla outside bones.

It appears that it is not very competitive.

The unchanged Maverick, “Llama-4-Maverick-17B-128E instruct,” was ranked under models Including GPT-4O from OpenAI, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro from Friday. Many of these models are months old.

The release version of Llama 4 has been added to Lmarena after it was discovered that they cheated, but you probably didn’t see it because you have to scroll down to the 32nd place where ranks are pic.twitter.com/A0BXKDX4LX

– ρ: ɡeσn (@pigeon__s) April 11, 2025

Why the bad performance? Meta’s Experimental Maverick, Llama-4-Maverick-03-26 experimental, was “optimized for conversationality”, the company explained in a Published graph Last Saturday. Those optimisations apparently played well with LM Arena, where human assessors compare the output of models and choose what they prefer.

As we have written before, for various reasons, LM Arena has never been the most reliable measure for the performance of an AI model. Nevertheless, tuning a model to a benchmark – in addition to being misleading – makes it a challenge for developers to predict exactly how well the model will perform in different contexts.

In a statement, a meta spokesperson told WAN that experimates Meta with “all types of adapted variants”.

“Llama-4-Maverick-03-26-experimental” is an optimized chat version with which we have experimented that also performs well on Lmarena, “said the spokesperson. “We have now released our Open Source version and will see how developers Llama 4 adjust for their own user cases. We are delighted to see what they will build and look forward to their continuous feedback.”

Source link