AI

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Every Sunday the NPR-Gastheer Shortz, the crossword puzzle guru of the New York Times, thousands of listeners in a long-term quiz quiz Sunday puzzle. While written to be without soluble at A lot of prior knowledge, the Branteasers are usually a challenge, even for competent participants.

That is why some experts think they are a promising way to test the limits of AI’s problem -solving skills.

In one Recent StudyA team of researchers from Wellesley College, Oberlin College, the University of Texas on Austin, Northeastern University, Charles University and Startup Cursor created an AI benchmark with riddles with sidelines of SUNGZLE FEATS. The team says that their test has discovered surprising insights, such as those reasoning models – including OpenAi’s O1 – sometimes “give up” and give answers that they know they are not correct.

“We wanted to develop a benchmark with problems that people can understand with only general knowledge,” Arjun Guha, an computer science-faculty member in Northeastern and one of the co-authors in the study, told WAN.

The AI ​​industry is currently in a bit of a benchmarking -dilemma. Most tests that are usually used to evaluate AI models, investigations into skills such as competence on mathematics and science questions on doctorate on doctorate, which are not relevant to the average user. Meanwhile, many benchmarks – even Benchmarks relatively recently – The saturation point is approaching quickly.

The benefits of a public radio quiz game such as the Sunday puzzle is that not testing for esoteric knowledge, and the challenges are formulated in such a way that models cannot draw from “Rote Memory” to solve them, Guha explained.

See also  Models jailed after storming court during Timberwolves game

“I think what makes these problems difficult is that it is really difficult to make meaningful progress until you solve it – that’s when everything clicks together in one go,” Guha said. “That requires a combination of insight and an elimination process.”

No benchmark is of course perfect. The Sunday puzzle is only US Centric and English. And because the quizzes are publicly available, it is possible that models that have been trained on them can ‘cheat’ in a certain sense, although Guha says he has not seen any proof of this.

“New questions are released every week and we can expect that the latest questions are really unseen,” he added. “We are planning to keep the benchmark fresh and keep track of how model performance changes over time.”

On the benchmark of the researchers, which consists of around 600 Sunday puzzle riddles, reasoning models such as O1 and the R1 of Deepseek on the rest. Reasoning models check themselves thoroughly before they give results, which helps them to avoid some of the pitfalls that normally stumble AI models. The assessment is that reasoning models take a little longer to come to solutions to minors to minutes longer.

At least one model, Deepseek’s R1, gives solutions that it knows to be wrong for some of the Sunday puzzle questions. R1 will literally state “I give up”, followed by an incorrect answer that apparently can be seen randomly – behavior that this person can certainly relate to.

The models make other bizarre choices, such as giving a wrong answer only to withdraw it immediately, try to tease a better and to fail again. They also get stuck forever and give nonsensical explanations for answers, or they immediately come to a correct answer, but then consider alternative answers for no clear reason.

See also  EAGLE: Exploring the Design Space for Multimodal Large Language Models with a Mixture of Encoders

“In the event of difficult problems, R1 literally says that it” gets frustrated, “said Guha. “It was funny to see how a model emulates what a person could say. It is still to be seen how ‘frustration’ in reasoning can influence the quality of model results. “

NPR -Benchmark
R1 is “frustrated” on a question in the Sunday puzzle challenge set.Image Credits:Guha et al.

The current best performing model on the benchmark is O1 with a score of 59%, followed by the recently released O3-Mini set for high “reasoning effort” (47%). (R1 scored 35%.) As the next step, the researchers are planning to broaden their tests to extra reasoning models, which they hope will help to identify areas where these models can be improved.

NPR -Benchmark
The scores of the models that the team tested on their benchmark.Image Credits:Guha et al.

“You do not need a doctorate to be good in reasoning, so it should be possible to design reasoning benchmarks that do not require a promotion level knowledge,” Guha said. “A benchmark with wider access enables a wider set of researchers to understand and analyze the results, which in turn can lead to better solutions in the future. In addition, since ultramodern models are increasingly used in institutions that influence everyone, we believe that everyone should be able to be able to intuition where these models cannot be. “

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button