Google’s new AI training method helps small models tackle complex reasoning


Researchers at Google Cloud And UCLA have proposed a novel reinforcement learning framework that significantly improves the ability of language models to learn highly challenging multi-step reasoning tasks. Guided reinforcement learning (SRL) reframes problem solving as a sequence of logical “actions,” which provide rich learning signals during the training process.
This approach allows smaller models to learn complex problems that were previously beyond the reach of other commonly used training techniques. Experiments show that SRL not only excels at mathematical reasoning, but also generalizes effectively to agentic software engineering tasks.
SRL is a versatile training framework that can take smaller and cheaper models to higher reasoning capabilities.
The limits of current LLM reasoning training
Recent advances in training large language models (LLMs) for reasoning are largely due to reinforcement learning with verifiable rewards (RLVR), a method in which a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.
However, the success of this results-based approach depends on the model’s ability to find a correct solution within a limited number of attempts, or ‘rollout’. Because each rollout is computationally expensive, models cannot try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.
This creates a critical learning bottleneck. In many multi-step reasoning problems, a model may solve multiple steps correctly but be derailed by a single error, leading to an incorrect answer. With RLVR, this entire effort is negatively rewarded, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that provides no detailed feedback and few rewards.
An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the complete reasoning process prepared by experts. While SFT can boost reasoning skills, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data rather than learning to generalize to problems beyond the examples it has seen). This problem is compounded by the fact that high-quality, human-created training data is both scarce and expensive to produce.
As the article notes, these limitations “leave a crucial gap for training small open-source models to effectively learn difficult problems.”
How supervised reinforcement learning works
SRL introduces a framework that reframes problem solving as a ‘sequential decision-making process’, striking a balance between pure results-oriented RL and pure imitation learning. Rather than optimizing only for the final answer or forcing the model to imitate an expert’s entire thought process, SRL teaches the model to reproduce a series of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to those of an expert, while developing its own internal reasoning style.
In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action can be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution paths, which are then used to train a smaller model.
According to I-Hung Hsu, research scientist at Google and co-author of the paper, this middle-of-the-road approach is critical to its effectiveness in real-world scenarios. “SRL sits in the middle: it reflects the structured flexibility of real-world problem solving, where there are multiple valid strategies, but also clear ideas about what ‘good reasoning’ looks like at every step,” Hsu told VentureBeat. “This makes SRL suitable for domains like data science automation or probably supply chain optimization – tasks that reward sound intermediate reasoning rather than merely definitive answers.”
During training, the model first generates an ‘inner monologue’ (the internal reasoning process, embedded in
SRL in action
The researchers’ experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also noted that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve the quality of the solution without simply lengthening the results.
For business leaders, performance improvements are only valuable if they don’t come with runaway costs. Hsu clarifies that SRL-trained models reason more efficiently. “The gains come from better reasoning quality and structure, not from verbosity,” he said. “In terms of efficiency, SRL-trained models are approximately equal to the base model in terms of token usage… while SRL is not designed to reduce inference costs, it achieves stronger reasoning performance without increasing it.”
The team has refined itself for the math tests Qwen2.5-7B-Instruct on a dataset of 1000 difficult mathematics questions. They compared performance with models trained with SFT and RLVR (using the GRPO algorithm common in models such as DeepSeek-R1) on four competition-level math benchmarks. The model trained by SRL achieved a significant average performance improvement of 3.0% over other methods.
The team expanded SRL into agentic software engineering, an area critical to business automation. They trained a model specialized in coding, Qwen2.5-Coder-7B-Instructon 5,000 expert paths of agents interacting with a coding environment. The model trained by SRL was compared to the original baseline model and SWE-Gym-7B, a strong baseline refined with SFT. SRL achieved a task solution rate of 14.8%, representing a 74% relative improvement over the SFT-based model. This demonstrates SRL’s ability to train more competent AI agents for complex, real-world programming tasks.
A new standard for high-stakes AI?
The paper’s strongest results came from combining methods: first using SRL to teach basic reasoning, and then using RLVR to refine that skill. In their experiments, when the researchers used SRL as pre-training and applied RLVR during post-training, they observed an average increase of 3.7%, demonstrating a powerful curriculum learning strategy.
This raises the question of whether this could become a new blueprint for building specialized AI.
“We view SRL as a strong foundation,” Hsu said. “In a sense, SRL provides a curriculum – teaching models to think and act step by step – before refining that behavior with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL phase, but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications.”
Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the way forward. “While high-quality expert pathways remain important,” he concluded, “we think the next big step will come from automating their generation and filtering – using strong teacher models or even self-improving student models to build new data.”




