People are using Super Mario to benchmark AI now

March 4, 2025

0 1 minute read

Did Pokémon think a tough benchmark for AI? A group of researchers claims that Super Mario Bros. is even more difficult.

Hao Ai Lab, a research organization at the University of California San Diego, threw AI in live Super Mario Bros. -Games. Anthropic’s Claude 3.7 carried out the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and GPT-4O from OpenAI struggled.

It was not entirely the same version of Super Mario Bros. As the original release from 1985, to be clear. The game ran in an emulator and integrated with a framework, Gaming agentTo give the AIS control over Mario.

Super Mario Bros. Ai -benchmark — **Image Credits:**Hao Lab

Gaming agent, who developed Hao in-house, fed the AI-base instructions, such as: “If an obstacle or enemy is close by, move/jump left to avoid” and in-game screenshots. The AI then generated entrances in the form of Python code to control Mario.

Yet Hao says that the game forced every model to plan complex maneuvers and to develop gameplay strategies. Interestingly, the laboratory discovered that reasoning models such as OpenAI’s O1, who think step by step ‘think’ by ‘think’ problems to come up with solutions, performed worse than ‘non-re-realing’ models, even though they were generally stronger in most benchmarks.

One of the main reasons why reasoning models have problems to play real-time games such as these is that they take a while to decide seconds-to decide on actions, according to the researchers. In Super Mario Bros. Is timing everything. A second can mean the difference between a jump that has been safely erased and a decrease in your death.

Games have been used for decades to benchmark AI. But Some experts have questioned wisdom From making connections between AI’s game skills and technological progress. Unlike the real world, games are usually abstract and relatively simple, and they offer a theoretical infinite amount of data to train AI.

The recent flashing gaming -benchmarks point to what Andrej Karpathy, a research scientist and founder at OpenAI, called an ‘evaluation crisis’.

‘I don’t really know what [AI] Statistics to look at now, “he wrote in one Post on X. “TLDR My reaction is that I don’t really know how good these models are now.”

At least we can see Ai play Mario.

Source link