Large reasoning models almost certainly can think


There has been a lot of fuss lately about the idea that large reasoning models (LRM) are incapable of thinking. This is mainly due to a research article published by Apple: “The illusion of thought“Apple argues that LRMs should not be able to think; instead, they just perform pattern matching. The evidence they provided is that LRMs with Chain of Thought (CoT) reasoning are unable to continue computation using a predefined algorithm as the problem grows.
This is a fundamentally flawed argument. For example, if you ask a human who already knows the algorithm for solving the Tower of Hanoi problem to solve a Tower of Hanoi problem with twenty disks, he or she would almost certainly fail. Based on that logic, we must conclude that humans cannot think either. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone certainly does not mean that LRMs can think – just that we cannot be sure that they do not.
In this article I will make a stronger claim: LRMs can almost certainly think. I say ‘almost’ because there is always a chance that further research could surprise us. But I think my argument is quite convincing.
What is thinking?
Before we try to understand whether LRMs can think, we need to define what we mean by thinking. But first we need to make sure people can think according to the definition. We will only think about solving problems, which is a point of contention.
1. Problem representation (frontal and parietal lobes)
When you think about a problem, your prefrontal cortex becomes involved. This region is responsible for working memory, attention, and executive functions – abilities that help you keep the problem in mind, break it down into subcomponents, and set goals. Your parietal cortex helps encode the symbolic structure for math or puzzle problems.
2. Mental simulation (morning memory and inner speech)
This has two components: one is an auditory loop that allows you to talk to yourself – similar to generating CoT. The other is visual imagery, which allows you to visually manipulate objects. Geometry was so important to navigating the world that we developed specialized capabilities for it. The auditory part is linked to Broca’s area and the auditory cortex, both recycled from language centers. The visual cortex and parietal areas mainly control the visual component.
3. Pattern matching and retrieval (hippocampus and temporal lobes)
These actions depend on past experiences and stored knowledge from long-term memory:
-
The hippocampus helps retrieve related memories and facts.
-
The temporal lobe brings in semantic knowledge – meanings, rules, categories.
This is similar to how neural networks rely on their training to process the task.
4. Monitoring and Evaluation (Anterior Cingulate Cortex)
Our anterior cingulate cortex (ACC) checks for errors, conflicts or impasses – it’s where you notice contradictions or dead ends. This process is essentially based on matching patterns from previous experiences.
5. Insight or Reframing (Network and Right Brain Default Mode)
When you’re stuck, your brain can switch to: standard mode – a more relaxed, internally focused network. This is the moment when you take a step back, let go of the current thread and sometimes ‘suddenly’ see a new angle (the classic ‘aha!’ moment).
This is similar to how DeepSeek-R1 was trained in CoT reasoning without CoT examples in the training data. Remember that the brain is constantly learning as it processes data and solves problems.
On the other hand LRMs should not change based on real-world feedback during prediction or generation. But with DeepSeek-R1’s CoT training, learn did happening as it tried to solve the problems – essentially updating as it reasoned.
Similarities between CoT reasoning and biological thinking
LRM does not have all the above-mentioned faculties. For example, it is very unlikely that an LRM will do too much visual reasoning in its circuit, although some can happen. But it certainly does not produce intermediate images in the CoT generation.
Most people can create spatial models in their heads to solve problems. Does this mean that we can conclude that LRMs cannot think? I don’t agree with that. Some people also find it difficult to form spatial models of the concepts they are thinking about. This condition is called aphantasia. People with this condition can think well. In fact, they go through life as if they lack no skills whatsoever. Many of them are actually good at symbolic reasoning and quite good at math – often enough to compensate for their lack of visual reasoning. We might expect that our neural network models can also overcome this limitation.
If we take a more abstract view of the human thought process described earlier, we can mainly see the following:
1. Pattern matching is used for recalling lessons learned, representing problems, and monitoring and evaluating thought chains.
2. The working memory stores all intermediate steps.
3. Backtracking concludes that the CoT isn’t going anywhere and returns to a reasonable point.
Pattern matching in an LRM comes from its training. The whole point of training is to learn both knowledge of the world and the patterns to process that knowledge effectively. Because an LRM is a layered network, the entire working memory must fit into one layer. The weights store the knowledge of the world and the patterns to follow, while the processing between layers takes place using the learned patterns stored as model parameters.
Note that even in CoT, the entire text – including the input, CoT and some of the output already generated – must fit in each layer. Working memory is only one layer (in the case of the attention mechanism, this includes the KV cache).
In fact, CoT is very similar to what we do when we talk to ourselves (which is almost always the case). We almost always express our thoughts, and so does a CoT reasoner.
There is also good evidence that CoT reasoners can take steps back when a particular argument seems futile. This is essentially what the Apple researchers saw when they tried asking the LRMs to solve larger instances of simple puzzles. The LRMs correctly recognized that solving the puzzles directly would not fit their working memory, so they tried to come up with better shortcuts, just as a human would do. This is further evidence that LRMs are thinkers, not just blind followers of predefined patterns.
But why would a next-token forecaster learn to think?
Neural networks of sufficient size can learn any computation, including thinking. But a system for predicting the next word can also learn to think. Let me explain further.
A common idea is that LRMs cannot think because they ultimately just predict the next token; it’s just a ‘glorified autocomplete.’ This view is fundamentally flawed – not that it is an ‘auto-complete’, but that an ‘auto-complete’ does not need to think. In fact, the prediction of the next word is far from a limited representation of thought. On the contrary, it is the most general form of knowledge representation that anyone can hope for. Let me explain.
When we want to represent some knowledge, we need a language or a system of symbolism to do so. Several formal languages exist that are very precise in what they can express. However, such languages are fundamentally limited in the types of knowledge they can represent.
For example, first-order predicate logic cannot represent the properties of all predicates that satisfy a given property, because it does not allow predicates over predicates.
Of course, there are higher-order predicate calculi that can represent predicates upon predicates to arbitrary depths. But even they cannot express ideas that are not precise or abstract in nature.
However, natural language has full expressive power; you can describe any concept at any level of detail or abstraction. You can even describe concepts about natural language that uses the natural language itself. That makes it a strong candidate for knowledge representation.
The challenge, of course, is that this expressive richness makes it more difficult to process the information encoded in natural language. But we don’t necessarily need to understand how to do this manually; we can simply program the machine using data, through a process called training.
A next token prediction engine essentially calculates a probability distribution over the next token given a context of previous tokens. Any machine that wants to calculate this probability accurately must represent world knowledge in some form.
A simple example: consider the incomplete sentence: “The highest mountain peak in the world is the mountain…” – to predict the next word as Everest, this knowledge must be stored somewhere in the model. If the task requires the model to calculate the answer or solve a puzzle, the next-token predictor must output CoT tokens to continue the logic.
This implies that even though it predicts one token at a time, the model must internally represent at least the next few tokens in its working memory – enough to ensure that it stays on the logical path.
If you think about it, people also predict the next sign – whether it is while speaking or thinking using the inner voice. A perfect autocomplete system that always issues the correct tokens and produces correct answers should be omniscient. Of course, we will never reach that point, because not every answer is computable.
A parameterized model that can represent knowledge by tuning its parameters, and that can learn through data and reinforcement, can certainly learn to think.
Does it produce the effects of thought?
Ultimately, the ultimate test of thinking is a system’s ability to solve problems that require thinking. If a system can answer previously unseen questions that require some level of reasoning, it must have learned to think – or at least reason – toward the answer.
We know that proprietary LRMs perform very well on certain reasoning benchmarks. However, since there is a possibility that some of these models may have been backdoored against benchmark test sets, we will only focus on open source models for honesty and transparency.
We assess them against the following benchmarks:
As you can see, in some benchmarks LRMs are capable of solving a significant number of logic-based queries. While it is true that in many cases they still lag behind human performance, it is important to note that the human baseline often comes from individuals trained specifically on those benchmarks. In some cases, LRMs even perform better than the average untrained human.
Conclusion
Based on the benchmark results, the striking similarity between CoT reasoning and biological reasoning, and the theoretical insight that any system with sufficient representation capacity, sufficient training data, and sufficient computing power can perform any computable task – LRMs meet these criteria to a significant degree.
It is therefore reasonable to conclude that LRMs almost certainly have the ability to think.
Debasish Ray Chawdhuri is a senior chief engineer at Talentica software and a Ph.D. candidate in cryptography at IIT Bombay.
Read more of our guest writers. Or consider posting yourself! See our guidelines here.



