AI

Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Earlier this week, Meta landed in hot water for the use of an experimental, non -released version of his LLAMA 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM Arena. The incident has asked the LM Arena under holders to apologizeChange their policy and score the unaltered, vanilla outside bones.

It appears that it is not very competitive.

The unchanged Maverick, “Llama-4-Maverick-17B-128E instruct,” was ranked under models Including GPT-4O from OpenAI, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro from Friday. Many of these models are months old.

Why the bad performance? Meta’s Experimental Maverick, Llama-4-Maverick-03-26 experimental, was “optimized for conversationality”, the company explained in a Published graph Last Saturday. Those optimisations apparently played well with LM Arena, where human assessors compare the output of models and choose what they prefer.

As we have written before, for various reasons, LM Arena has never been the most reliable measure for the performance of an AI model. Nevertheless, tuning a model to a benchmark – in addition to being misleading – makes it a challenge for developers to predict exactly how well the model will perform in different contexts.

In a statement, a meta spokesperson told WAN that experimates Meta with “all types of adapted variants”.

“Llama-4-Maverick-03-26-experimental” is an optimized chat version with which we have experimented that also performs well on Lmarena, “said the spokesperson. “We have now released our Open Source version and will see how developers Llama 4 adjust for their own user cases. We are delighted to see what they will build and look forward to their continuous feedback.”

See also  Salesforce to invest $1B in Singapore to boost adoption of AI



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button