A new, challenging AGI test stumps most AI models

The ARC Prize Foundation, a non-profit organization that was co-founded by prominent AI researcher François Chollet, announced in a Blog post On Monday it created a new, challenging test to measure the general intelligence of leading AI models.
So far, the new test, called ARC-Agi-2, has punched most models.
“Reason” AI models such as OpenAI’s O1-Pro and Deepseek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Bow prize leaders board. Powerful non-repeat models including GPT-4.5, Claude 3.7 Sonnet and Gemini 2.0 Flash score around 1%.
The ARC-agi tests consist of puzzle-like problems in which an AI visual patterns must identify from a collection of various colored squares and generate the correct “answer” schedule. The problems are designed to force an AI to adapt to new problems that it has never seen before.
The Arc Prize Foundation had more than 400 people to take ARC-AGI-2 to establish a human baseline. On average ‘panels’ of these people got 60% of the test questions – much better than the scores of the models.

In one Post on XChollet claimed that ARC-AGI-2 is a better measure for the actual intelligence of an AI model than the first iteration of the test, ARC-Agi-1. The tests of the ARC Prize Foundation are aimed at evaluating whether an AI system can acquire new skills outside the data on which it is trained.
Chollet said that the new test, unlike ARC-Agi-1, prevents AI models from trusting “brutal strength”-executed computing power-to find solutions. Chollet previously acknowledged that this was a big mistake by Arc-Agi-1.
To tackle the errors of the first test, ARC-AGI-2 introduces a new measurement value: efficiency. It also requires models to interpret patterns instead of trusting memorization.
“Intelligence is not only determined by the possibility of solving problems or reaching high scores,” wrote co-founder of ARC Prize Foundation Greg Kamradt in a Blog post. “The efficiency with which those possibilities are taken over and implemented is a crucial, determining component. The key question that is being asked is not alone: ’AI can acquire [the] Skill to resolve a task? “But also,” on what efficiency or costs? ” “
Arc-Agi-1 was unbeaten for about five years until December 2024, when OpenAi released its advanced reasoning model, O3, that all other AI models surpassed and the human performance in the evaluation corresponded. However, as we noticed at the time, the performance of O3 on ARC-Agi-1 came with a good price tag.
The version of the O3 model of OpenAI-O3 (Low)-which first reached new heights on ARC-Agi-1, which scored 75.7% on the test, received a measly 4% on ARC-AGI-2 with $ 200 in computing power per task.

The arrival of ARC-AGI-2 is, because many in the technical industry call on new, unsaturated benchmarks to measure AI forecast. The co-founder of Hugging Face, Thomas Wolf, recently told WAN that the AI industry lacks insufficient tests to measure the most important properties of so-called artificial general intelligence, including creativity.
In addition to the new benchmark, the Arc Prize Foundation announced A new arch prize 2025 competitionChallenging developers to achieve 85% accuracy on the ARC-AGI-2 test, while he only spends $ 0.42 per task.