Nvidia researchers boost LLMs reasoning skills by getting them to 'think' during pre-training


Researchers at Nvidia have developed a new technique that flips the script on how large language models (LLMs) learn to reason.
The method, called reinforcement learning pre-training (RLP), integrates RL into the initial training phase rather than saving it for the end.
This approach encourages the model to “think for itself before predicting what comes next, thus learning independent thinking behavior earlier in pre-training,” the researchers say in their article.
By learning to reason from plain text without the need for external verifiers, models trained with RLP show significant improvements in learning complex reasoning tasks downstream, hinting at a future of more capable and adaptable AI for real-world tasks.
The typical LLM training cycle
Typically, large language models are first pre-trained on large amounts of text using a next-token prediction objective, where they are given a sequence of text and asked to continuously guess what the next word (or token) will be. In this stage they learn grammar, facts and basic associations.
In the later phase after training, models usually learn complex reasoning skills, such as: chain of thoughts (CoT) where a model explains its reasoning step by step. This phase often includes guided fine-tuning (SFT) or strengthening learning from human feedback (RLHF), which requires specialized, curated datasets.
The authors of the paper argue that this sequential process does not correspond to human understanding, which “is not a linear token-by-token process, but rather a parallel integration of input with prior knowledge.” Existing pre-training methods lack this mechanism, hindering a model’s ability to develop deep reasoning from the start.
How pre-training reinforcement learning works
RLP reformulates this process by viewing CoT generation as an action that the model takes before predicting the next token. At each step, the model first generates an internal ‘thought’ or reasoning chain. It then predicts the next word in the text, using the original context supplemented with the new thought.
The model receives a reward based on the extent to which its thought improved the accuracy of its prediction compared to a baseline that generated no thought (pure next-token prediction). This reward signal is automatically calculated based on the change in probability, eliminating the need for external verifications or human-labeled data.
The reward is only positive if the generated thought helps the model better predict the next token. By rewarding thoughts based on their predictive benefit, RLP effectively teaches the model how to think usefully based on the same massive, unstructured data sets used for standard pre-training.
This continuous feedback loop allows the model to learn when a simple predictive guess is sufficient and when deeper reasoning is needed. As the researchers put it: “RLP is designed to shape thinking in basic models reward only those thoughts that measurably help predict the next token.”
However, this fundamental approach does not make subsequent refinement phases redundant. According to Bryan Catanzaro, VP of applied deep learning research at Nvidia and co-author of the paper, RLP is designed to complement, not replace, these crucial steps. “RLP is not intended to replace the later phases after training, such as supervised fine-tuning or reinforcement learning through human feedback,” Catanzaro told VentureBeat. “Those phases remain critical to refining model behavior… It’s really designed to increase the effectiveness of those later phases by giving the model an edge.”
RLP in action
In experiments with Qwen3-1.7B And Nemotron-Nano-12Bthe Nvidia team tested RLP against a series of benchmarks for mathematical and scientific reasoning. The results show that models enhanced with RLP consistently outperformed their conventionally trained counterparts, with particularly strong gains on reasoning-heavy tasks.
For an enterprise, this improved reasoning could translate into more reliable results in multi-step workflows such as financial analysis or legal document summarization.
“RLP during pretraining encourages the model to think before predicting, allowing the model to internalize a more coherent reasoning style,” Catanzaro said. “This could help reduce subtle logic errors, especially in longer workflows.”
While emphasizing that RLP-trained models will still need the usual guardrails such as layers of verification, human oversight and consistency checks, Catanzaro said that “RLP gives you a stronger baseline.”
Importantly, the benefits of RLP compounding rather than dissipating during subsequent phases of refinement (catastrophic forgetting is a common problem in LLM training, where later training phases cause the model to forget its previously learned skills and knowledge). The RLP-trained model achieved an overall score 7-8% higher than baseline after an identical post-training regimen. The researchers conclude that RLP “lays a robust foundation of reasoning that is not swept away by downstream tuning, but instead is compounded by post-training.”
The efficiency of the technology is an important finding. On the Qwen3-1.7B model, RLP improved performance by 17% over standard continuous pre-training and also beat a similar technique called Reinforcement Pretraining via prefix-matching rewards (RPT). This advantage was maintained even when the base model was trained with 35 times more data to match the computational cost, confirming that the gain comes from the method itself and not just more processing.
Furthermore, RLP demonstrates impressive scalability and versatility, successfully extracting reasoning from general web data – not just curated datasets. When applied to the hybrid Mamba-Transformer model Nemotron-Nano-12B, RLP achieved a 35% relative improvement over a heavily trained baseline while only a small part of the data is used.
While these results point toward a more efficient path to building high-performance models, Catanzaro sees the innovation as a fundamental shift in the learning process itself, rather than an immediate solution to high training costs.
“This research is exciting because it offers a shift in the way models absorb information during pretraining, leading to smarter learning,” he explains. “It would not replace large-scale pretraining, but provide a new creative method to build the best possible models.”
A new foundation for AI training
Ultimately, RLP points to a future where pre-training is no longer a monolithic process of predicting the next token. Instead, the next generation of models could be built on a hybrid of objectives, creating AI that learns to think more robustly from day one. Catanzaro offers a powerful analogy to frame this shift:
“Next-token prediction teaches a model what the world looks like; strengthening objectives like RLP can teach the model to think about what it sees,” he said. “The combination of these two objectives could help models develop deeper, more structured thinking much earlier in training…Tools like RLP can build on that foundation, making learning more active, more curious and even more efficient.”
There is still much to learn about the dynamics of reinforcement learning in the pre-training phase, but what seems clear is that “introducing exploration earlier in training opens up a new axis of scaling – not just in size, but also in the way models learn to reason,” Catanzaro said.




