AI

Debates over AI benchmarking have reached Pokémon

Even Pokémon is not safe for AI -benchmarking controversy.

Last week, one Post on X Viral went and claimed that the newest Gemini model from Google the flagship Claude model from Anthropic surpassed in the original Pokémon video game trilogy. Allegedly Gemini Lavendar Town had reached the Twitch current of a developer; Claude was stuck on Mount Moon at the end of February.

But what the mail did not mention is that Gemini had an advantage.

As Users on Reddit Noted, the developer who maintains the Gemini stream has built an adapted minimap that helps the model to identify “tiles” in the game, such as a chainable trees. This reduces the need for Gemini to analyze screenshots before the gameplay decisions makes.

Now Pokémon is at best a semi-serious AI-benchmark-we would claim that it is a very informative test of the possibilities of a model. But the is An instructive example of how different implementations of a benchmark can influence the results.

For example anthropic reported Two scores for his recent Anthropic 3.7-Sonnet model on the Benchmark SWE-Bench verified, which was designed to evaluate the coding options of a model. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-Bench verified, but 70.3% with a “adapted scaffolding” that developed anthropically.

More recently, Meta has a version of one of its newer models, Llama 4 Maverick, tailored to perform well at a certain benchmark, LM Arena. The vanilla version of the model scores considerably worse with the same evaluation.

See also  DeepSeek’s R1 reportedly ‘more vulnerable’ to jailbreaking than other AI models

In view of the fact that AI-Benchmarks inclusive Pokémon-in the start are imperfect measures, used and non-standard implementations The waters are even more muddy. That is, it seems unlikely that it will be easier to compare models while they are released.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button