AI

OpenAI’s new reasoning AI models hallucinate more

The recently launched O3 and O4-Mini AI models from OpenAI are in many respects ultramodern. However, the new models still or invent – in fact – they hallucinate more Then different older models from OpenAi.

Hallucinations have turned out to be one of the greatest and most difficult problems to solve in AI, which even influence today’s best performing systems. Historically, every new model has been somewhat improved in the hallucination department, so that less than its predecessor hallucinates. But that does not seem to be the case for O3 and O4-Mini.

According to the internal tests of OpenAi, O3 and O4-Mini, which are so-called reasoning models, hallucinate more often Then the earlier reasoning models of the Company-O1, O1-Mini and O3-mini-E-Ess the traditional, “non-reasonable” models of OpenAi, such as GPT-4O.

Perhaps more in terms of, the chatgpt -maker does not really know why it happens.

In his technical report for O3 and O4-MiniOpenAI writes that “more research is needed to understand why Hallucinations get worse as the reasoning models scales up. O4 and O4-Mini perform better in some areas, including tasks with regard to coding and mathematics. But because they ‘generally make more claims’, they are often led to’ more accurate claims and more inaccurate.

OpenAi discovered that O3 Hallucinated in response to 33% of the questions about Personqa, the company’s internal benchmark for measuring the accuracy of the knowledge of a model about people. That is about double the hallucination percentage of the earlier reasoning models of OpenAi, O1 and O3-Mini, which scored 16% and 14.8% respectively. O4-Mini did it even worse on personqa-hallucinated 48% of the time.

See also  Pioneering Open Models: Nvidia, Alibaba, and Stability AI Transforming the AI Landscape

Third -party test Through translation, a non -profit AI Research Lab, it also found evidence that O3 tends to come up with actions that took it in the process of arriving on answers. In one example, Transluce has observed O3 that the code carried out on a MacBook Pro “outside Chatgpt” from 2021, and then copied the figures in his answer. Although O3 has access to some tools, it is not possible.

“Our hypothesis is that the type of reinforcement learning that is used for O-series models can reinforce problems that are usually limited (but not completely erased) by standard pipelines after training,” said Neil Chowdhury, a translating researcher and former OpenAi employee, in an email to WAN.

Sarah Schwettmann, co-founder of Drancuce, added that the hallucination percentage of O3 could make it less useful than would be otherwise.

Kian Katanforoosh, a Stanford -Adjunct -Professor and CEO of the Upskilling Startup Worka, WAN told that his team already tested O3 in their coding workflows, and that they have found it one step above the competition. Katanforoosh, however, says that O3 tends to hallucinate broken website left. The model delivers a link that, when clicked, does not work.

Hallucinations can help models to achieve interesting ideas and be creative in their ‘thinking’, but they also make some models a tough sale for companies in markets where accuracy is of the utmost importance. For example, a law firm would probably not be satisfied with a model that inserts many factual errors in customer contracts.

A promising approach for stimulating the accuracy of models is them to give web search options. OpenAi’s GPT-4O with WebZoekactie reached 90% accuracy On Simpleqa, one of the accuracy penchmarks from OpenAi. Potentially searching can also improve the hallucination rates of the reasoning models in cases where users are willing to expose instructions to an external search provider.

See also  How Does AI Use Impact Critical Thinking?

If scaling up reasoning models does indeed continue to worsen hallucinations, this will make the hunt for a solution all the more urgent.

“Tackling hallucinations in all our models is a constant research area, and we are constantly working on improving their accuracy and reliability,” said OpenAi spokesperson Niko Felix in an e -mail to WAN.

In the past year, the wider AI industry has played to concentrate on reasoning models after techniques to improve traditional AI models that started to show decreasing returns. The reasoning improves the model performance on different tasks without the need for huge amounts of computer use and data during training. Yet it seems that reasoning that can also lead to more hallucinatory – a challenge to present.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button