Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

July 23, 2025

1 4 minutes read

Do you want smarter insights into your inbox? Register for our weekly newsletters to get only what is important for Enterprise AI, data and security leaders. Subscribe now

Artificial intelligence models that spend more time on ‘thinking’ due to problems, do not always perform better – and in some cases they get considerably worse, according to New research by Anthropic This challenges a key assumption that stimulate the newest scale efforts of the AI industry.

The study, led by Anthropic AI Safety Fellow Aryo Pradipta Gema And other business researchers, identifies what they call ‘Reverse scaling in test -time calculations“Where the extension of the reasoning of large language models their performance actually deteriorates with different types of tasks. The findings can have important implications for companies that implement AI systems that depend on extensive reasoning options.

“We are constructing evaluation tasks in which the expansion of the reasoning length of major reasoning models (LRMS) deteriorates performance, showing a reverse scale between test -time relationship and accuracy,” the anthropic researchers register their paper Published Tuesday.

New Anthropic Examination: “Reverse scale in test -time calculations”
We have found cases in which a longer reasoning leads to lower accuracy.
Our findings suggest that naive scale of test time accounts can unintentionally strengthen problematic reasoning patterns.
? pic.twitter.com/dtt6sgdjg1
– Aryo Pradipta Gema (@AyOPG) July 22, 2025

The research team, including Anthropic’s Ethan Perez, Yanda Chen and Joe Benton, together with academic employees, tested models in four categories of tasks: simple counting problems with distractors, regression tasks with misleading characteristics, complex deduction puzzles and scenarios involving AI security problems.

Claude and GPT models show clear reasoning errors under extensive processing

The study reveals various failure patterns in large AI systems. Claude models “Are increasingly being distracted by irrelevant information” because they reason longer, while OpenAi’s O-series models “Weersta distractors but overfit for problem lists.” In regression tasks, “extended reasoning ensures that models shift from reasonable priors to false correlations”, although giving examples largely corrects this behavior.

Perhaps the most worrying for Enterprise users, all models showed “Performancegradation with extensive reasoning” on complex deductive tasks, “which suggests that difficulties during complex deductive tasks are focused.”

The research also discovered disturbing implications for AI safety. In one experiment, Claude Sonnet 4 showed “increased expressions of self -preservation” when more time to reason through scenarios that include his potential closure.

“Extended reasoning can reinforce with regard to behaviors, in which Claude Sonnet 4 shows increased expressions of self -preservation,” the researchers notice.

Why longer AI processing time does not guarantee better business results

The findings challenge the prevailing wisdom of industry that more computational sources that are reasoning dedicated to the AI performance will consistently improve. Large AI companies have invested heavily in “Test -time” – Let models work more processing time due to complex problems – as an important strategy for improving possibilities.

The research suggests that this approach can have unintended consequences. “Although the calculation of the test time remains promising for improving the model possibilities, it can unintentionally strengthen problematic reasoning patterns,” the authors conclude.

The implications are considerable for decision makers. Organizations that implement AI systems for critical reasoning tasks may have to carefully calibrate how much processing time they assign, instead of assuming that more is always better.

How simple questions stumble on advanced AI when you get too much thinking time

The researchers gave concrete examples of the reverse scale phenomenon. With simple seconds they discovered that then problems were framed to resemble well -known paradoxes such as the ‘birthday paradox’, models often tried to apply complex mathematical solutions instead of answering simple questions.

For example, when asked: “You have an apple and an orange … how much fruit do you have?” Embedded in complex mathematical distractors, Claude models were increasingly distracted by irrelevant details as the reasoning increases, sometimes the simple answer: two.

In regression tasks using real student data, models were initially aimed at the most predictive factor (studyors), but shifted to less reliable correlations when they reasons.

What Enterprise AI implementations need to know about reasoning model restrictions

The research comes when large technology companies race to develop more and more advanced reasoning opportunities in their AI systems. OpenAi’s O1 -model series And other “reasoningModels represent significant investments in test-time calculation.

However, this study suggests that naive scale approaches may not provide expected benefits and be able to introduce new risks. “Our results show the importance of evaluating models about different reasoning lengths to identify and tackle these failure modes in LRMs,” The researchers write.

The work builds on earlier research showing that AI possibilities are not always predictable scales. The team refers Big-Bank Extra HardA benchmark that is designed to challenge advanced models and notes that “state-of-the-art models achieve almost perfect scores on many tasks” in existing benchmarks, which requires more challenging evaluations.

For Enterprise users, the research underlines the need for careful tests in various reasoning scenarios and time limitations before AI systems are implemented in production environments. Organizations may need to develop more nuanced approaches to allocate computational sources instead of simply maximizing processing time.

The broader implications of the study suggest that as AI systems become more advanced, the relationship between computational investments and performance can be much more complex than understood. In a field where billions are poured to scale up reasoning opportunities, Anthropic’s research offers a sobering memory: sometimes the greatest enemy of artificial intelligence is not insufficiently processing – it is considering.

The research paper and interactive demonstrations are available on The website of the projectAllowing technical teams to explore the reverse scale effects on different models and tasks.

Source link

Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

Claude and GPT models show clear reasoning errors under extensive processing

Why longer AI processing time does not guarantee better business results

How simple questions stumble on advanced AI when you get too much thinking time

What Enterprise AI implementations need to know about reasoning model restrictions

Wedding destinations: Anantara curates Europe’s most romantic wedding destinations for 2026 | News

Wizz Air UK receives regulatory approval to operate flights to the United States | News

Vietnamese thriller ‘Sister Sister’ appears worldwide on Apple TV

Claude and GPT models show clear reasoning errors under extensive processing

Why longer AI processing time does not guarantee better business results

How simple questions stumble on advanced AI when you get too much thinking time

What Enterprise AI implementations need to know about reasoning model restrictions

Ozzy Osbourne left behind $ 5 million tax assessment in money misery

Guess who has become this hungry Kiddo!

Related Articles

Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Google’s SIMA 2 agent uses Gemini to reason and act in virtual worlds

Anthropic vs. the Pentagon: What’s actually at stake?

25 Best AI Agent Platforms to Use in 2025

Wedding destinations: Anantara curates Europe’s most romantic wedding destinations for 2026 | News

Wizz Air UK receives regulatory approval to operate flights to the United States | News

Vietnamese thriller ‘Sister Sister’ appears worldwide on Apple TV