AI

Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

Researchers from the University of Science and Technology of China have developed a new reinforcement learning (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as mathematics and coding.

Their framework, Agent-R1is compatible with popular RL algorithms and shows significant improvements on reasoning tasks that require multiple retrieval phases and multi-turn interactions with tools.

The framework is built on a redefinition of the RL paradigm, taking into account the dynamic nature of agentic applications that require interaction with evolving environments and imperfect information. This framing is much more similar to real-world applications and could have important applications for agentic tasks in enterprise environments.

Rethinking reinforcement learning for agents

RL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas such as math and coding, the model receives a clear signal: the answer is right or wrong. This makes it relatively easy to reward or punish his behavior.

But this approach suffers from agentic tasks that require models to work in interactive environments, develop dynamic memories of conversations, reason in multiple steps, and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.

To address these challenges, researchers at the University of Science and Technology revisited the fundamental framework of RL, known as the Markov decision process (MDP). An MDP models decision making using four key components: a state space (the set of possible states an agent can find itself in); an action space (what the agent can do); a probability of a state transition (the state to which an action is likely to lead); and a reward function (whether the outcome is good or bad). The article proposes to extend this framework to make it more suitable for LLM agents.

See also  Equity's 2026 Predictions: AI Agents, Blockbuster IPOs, and the Future of VC

In the new formulation, the state space is expanded to include not only the current state (the current set of tokens generated by the model), but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific strings of text can now trigger external tools, such as an API call. State transitions become unpredictable or ‘stochastic’, because the outcome depends not only on the tokens the model predicts, but also on the response of the environment, which depends on external factors. Finally, the reward system becomes more detailed and includes intermediate ‘process rewards’ for successfully completing steps along the way, rather than just a single reward at the very end. This ensures more frequent and precise guidance of the agent during training.

This last bit is especially important and addresses the ‘scarce reward’ problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it took along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.

“These extensions are crucial for enabling reinforcement learning algorithms to train advanced agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.

The Agent-R1 framework

Based on the expanded MDP definition, the researchers developed Agent-R1a flexible and easy-to-use training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, enabling seamless integration with diverse environments.

See also  Nearly 80% of Training Datasets May Be a Legal Hazard for Enterprise AI

The main difference lies in the ‘deployment phase’, where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.

Agent-R1 realizes this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions, such as calling an API or accessing a database. When a Tool is called, it performs its action and returns the immediate, raw result. The ToolEnv ​​module, on the other hand, is the orchestrator and interpreter. It takes the output of the tool and determines how that outcome affects the agent’s state and overall task progress. ToolEnv ​​manages state transitions, calculates reward signals based on tool results, and packages the new state information for the agent.

Basically, when an action is completed, the Tool reports ‘what happened’, while ToolEnv ​​dictates ‘what this outcome means for the agent and the task’.

Agent-R1 in action

The researchers tested Agent-R1 on the challenging task of answering multi-step questions, which requires complex reasoning, retrieving information across multiple documents, and multi-step decision making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the HotpotQA And 2WikiMultihopQA datasets. They also tested it on the Musique dataset, which was outside the domain of the tasks the agent was trained for.

They compared different RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM responds based on one set of retrieved documents, and Base Tool Call, which uses the model’s native function calling capabilities without specialized RL training.

See also  ACE prevents context collapse with ‘evolving playbooks’ for self-improving AI agents

The results showed that all RL-trained agents performed substantially better than baseline. GRPO, an RL algorithm used in advanced reasoning models such as DeepSeek-R1delivered the best overall performance.

“These results robustly validate the efficacy of Agent-R1 in training high-performance LLM agents via end-to-end RL, demonstrating consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.

These findings could be of great importance to enterprises, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in the real world.

“We hope that Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers concluded.

Source link

Back to top button