A high schooler built a website that lets you challenge AI models to a Minecraft build-off

As conventional AI -benchmarking techniques prove to be insufficient, AI builders turn to more creative ways to assess the possibilities of generative AI models. For a group of developers that is Minecraft, the Sandbox-Building game from Microsoft.
The website Minecraft -Benchmark (or MC bench) was summarized to draw AI models against each other in head-to-head challenge to respond to prompts with Minecraft creations. Users can vote about which model it has done better, and only after voting can they see which AI has had each Minecraft built.
For Adi Singh, the 12th class that MC bench started, the value of Minecraft is not so much the game itself, but the fame with it is after all the the best Video game of all time. Even for people who have not played the game, it is still possible to evaluate which block -like representation of a pineapple is better realized.
“Minecraft enables people to see the progress [of AI development] Much easier, “Singh told WAN.” People are used to making Minecraft, used to the look and atmosphere. “
MC Bench currently mentions eight people as voluntary contributors. Anthropic, Google, OpenAi and Alibaba have subsidized the use of their products by the project to implement benchmark prompts, according to the MC-Bench website, but the companies are otherwise not connected.
“We are currently just doing simple builds to think about how far we have come from the GPT-3-era, but [we] Could see ourselves scales to these longer form plans and targeted tasks, “said Singh.” Games can simply be a medium to test agent reasoning that is safer than in real life and more controllable for testing purposes, making it more ideal in my eyes. “
Other games such as Pokémon Red, Street hunterAnd Pictionary have been used as experimental benchmarks for AI, partly because the art of benchmarking AI Notoir is difficult.
Researchers often test AI models Standardized evaluationsBut many of these tests give AI a home field advantage. Because of the way they are trained, models are naturally gifted with certain, narrow types of problem solution, in particular problem solving that requires memorization of rote or basic extrapolation.
Simply put, it is difficult to collect what it means that the GPT-4 of OpenAi can score in the 88th percentile on the LSAT, but cannot distinguish how many Rs are in the word ‘strawberry’. Anthropics Claude 3.7 Sonnet reached 62.3% accuracy on a standardized software engineering benchmark, but it is worse in playing Pokémon than most five-year-olds.

MC-Bench is technically a programming benchmark, because the models are asked to write code to make the prompt build, such as “Frosty the Snowman” or “a charming tropical beach hut on a unspoilt sandy coast.”
But it is easier for most MC bench users to evaluate whether a snowman looks better than to dig in code, which gives the project a broader attraction-and therefore the potential to collect more data about which models consistently scores better.
Whether those scores are a lot on the road of AI -Nut, of course, it is questioned. However, Singh claims that they are a strong signal.
“The current leaderboard reflects quite closely on my own experience with the use of these models, which is different from a lot of pure text benchmarks,” Singh said. “Maybe [MC-Bench] Can be useful for companies to know if they are going in the right direction. “