New 'Markovian Thinking' technique unlocks a path to million-token AI reasoning


Researchers at Mila have proposed a new technique that makes large language models (LLMs) much more efficient at performing complex reasoning. Called Markovian thinkingthe approach enables LLMs to perform long-term reasoning without incurring the prohibitive computational costs that currently limit such tasks.
The team’s implementation, an environment called Delethink, structures the reasoning chain into fixed-size chunks, breaking the scaling problem that plagues very long LLM responses. Initial estimates show that for a $1.5 billion parameter model, this method could reduce training costs by more than two-thirds compared to standard approaches.
The quadratic curse of long-chain reasoning
For an LLM to solve a complex problem, it often needs to generate a long series of intermediate “thinking” tokens, known as chain-of-thought (CoT). In recent years, researchers have discovered that the use of reinforcement learning (RL) to train models to produce longer CoTs (also called LongCoT) has significantly improved their reasoning ability.
However, the standard method for this has a critical flaw: the AI’s “state” (the prompt plus any reasoning tokens it has generated so far while processing it) grows with each new reasoning token. For modern transformer-based modelsThis means that the computational cost explodes quadratically as the reasoning chain lengthens, making it prohibitively expensive to train models for highly complex tasks.
Most current attempts to control these costs focus on limiting the amount of thinking the model does, implicitly favoring shorter solutions or ending the process early. While these methods provide some relief, the Mila researchers are still operating within the LongCoT framework and are thus fundamentally bound by its quadratic nature.
Instead of trying to control the math growth, Mila created an RL environment that avoids the quadratic problem altogether. As co-author Amirhossein Kazemnejad explained, the goal is to enable capabilities such as multi-week reasoning and scientific discovery. “That regime (and the RL required to enable such capabilities) is not supported by the current LongCoT paradigm due to the quadratic computational cost,” he said.
Think in chunks with Delethink
The researchers’ solution is a paradigm they call the “Markovian thinker,” in which the model reasons while keeping the size of the reasoning context window constant. The core idea is to change the RL setup to distinguish between “how long the model thinks” and “how much context it needs to process.” If done correctly, a Markovian thinker turns the quadratic growth problem into linear computation and fixed memory requirements for LLM reasoning.
The researchers put this paradigm into practice through Delethink, which forces the model to reason in a series of fixed-sized chunks, such as 8,000 tokens at a time. Within each part, the model reasons as it normally would, using the classical attention mechanism. But when the portion limit is reached, the environment resets the context, creating a new prompt containing the original query, plus a short “carryover” of the previous portion. For example, the transfer could consist of the last few tokens of the previous part of CoT or a summary of the main results.
This recasting of the problem forces the model to learn how to embed a summary of its progress, or a ‘textual Markovian state’, into this transfer to continue its reasoning in the next part. This addresses the common concern about whether the model can remember important details from previous steps.
According to Kazemnejad, the model learns what to remember. “With training… the model is forced to learn to continue the task-critical state,” he explained. He added crucial clarification for practical use: the original input prompt is not changed, including the documents or contextual data added to it. “Our approach focuses on the reasoning phase and does not change the question,” he said.
Deletin in action
To test their approach, the researchers trained R1-Distill-1.5B with Delethink on a dataset of competition-level math problems, and then evaluated it against several benchmarks. The model is trained to reason on up to 24,000 tokens, but with fixed blocks of 8,000 tokens.
The researchers compared this to models trained with the standard LongCoT-RL method. Their findings indicate that the model trained with Delethink could reason on up to 24,000 tokens, and matched or exceeded a LongCoT model trained on mathematical benchmarks with the same budget of 24,000 tokens. On other tasks, such as coding and PhD-level querying, Delethink also matched or slightly beat its LongCoT counterpart. “Overall, these results indicate that Delethink uses its thinking tokens as effectively as LongCoT-RL with less computing power,” the researchers write.
The benefits become even clearer when scaling beyond the training budget. While models trained with LongCoT quickly reached their training limit, the model trained by Delethink continued to improve its performance. For example, some math problems were only solved after the model had reasoned through 140,000 tokens, far exceeding the training budget of 24,000 tokens. This linear computing benefit is significant for enterprise applications. The researchers estimated that training a model to an average think length of 96,000 tokens would take 27 H100-GPU months with LongCoT, compared to just 7 with Delethink.
This efficiency extends directly to inference, the key operating expense for most enterprises. “Models trained in Markovian thinking use the same inference style (delethink tracing) during test time, which provides the same benefits as linear computing and constant memory after training,” says Kazemnejad. He gave a practical example: an AI agent could “debug a large codebase and think long and hard… which obviously reduces costs significantly compared to the conventional LongCoT approach.”
Interestingly, the researchers found that off-the-shelf reasoning models, even without any specific training, already show some ability to think in a Markovian way. This finding has immediate practical implications for developers. “In practice, this means that these models – without Delethink-RL – can already run a Delethink tracing wrapper and perform competitively with LongCoT on our benchmarked tasks,” Kazemnejad said.
Their experiments with larger models such as GPT OSS 120B showed robust performance with Delethink on a range of complex tasks. This latent ability provides a strong starting point for RL training and helps explain why the method is so effective. “Together, these results suggest that Delethink is compatible and scalable with state-of-the-art models,” the researchers concluded.
The success of Markovian Thinking shows that it could be possible that “the next generation of reasoning models can think for millions of tokens,” the researchers note. This opens the door to fundamentally new AI capabilities that go beyond current limitations.
“Markovian thinking… opens the way for models that can ‘think’ over very long horizons, which we consider a necessary step towards eventual scientific discovery,” Kazemnejad said. “Our approach removes a major bottleneck and can enable training for tasks with much longer horizons, enabling next-generation capabilities.”




