AI

New memory framework builds AI agents that can handle the real world's unpredictability

Researchers from the University of Illinois Urbana-Champaign And Google Cloud AI research have developed a framework that allows large language model (LLM) agents to organize their experiences into a memory bank, allowing them to become better at complex tasks over time.

The framework, called ReasoningBankdistills “generalizable reasoning strategies” from an agent’s successful and failed attempts to solve problems. The agent then uses this memory during inference to avoid repeating past mistakes and make better decisions when faced with new problems. The researchers show that in combination with test time scaling techniqueswhere an agent makes multiple attempts to solve a problem, ReasoningBank significantly improves the performance and efficiency of LLM agents.

Their findings show that ReasoningBank consistently outperforms classical memory mechanisms in web browsing and software engineering benchmarks, providing a practical path to building more adaptive and reliable AI agents for enterprise applications.

The challenge of LLM agent memory

Because LLM agents are deployed in applications that run for long periods of time, they are confronted with a continuous stream of tasks. One of the major limitations of today’s LLM agents is their failure to learn from this accumulated experience. By approaching each task in isolation, they inevitably repeat past mistakes, ignore valuable insights from related problems, and fail to develop skills that would make them more capable over time.

The solution to this limitation is to give agents some kind of memory. Previous attempts to give agents memory focused on storing past interactions for reuse by organizing information into various forms, from plain text to structured graphs. However, these approaches often fall short. Many use raw interaction logs or only store successful task examples. This means that they cannot distill transferable higher-level reasoning patterns and, crucially, they do not extract and use the valuable information from the agent’s failures. As the researchers note in their paper, “existing memory designs are often limited to passively tracking data rather than providing useful, generalizable guidance for future decisions.”

See also  How to navigate agents because of the growing economic uncertainty?

How Reasoning Bank works

ReasoningBank is a memory framework designed to overcome these limitations. The central idea is to distill useful strategies and reasoning tips from past experiences into structured memory items that can be stored and reused.

According to Jun Yan, a research scientist at Google and co-author of the paper, this marks a fundamental change in the way agents operate. “Traditional agents work statically: each task is processed separately,” Yan explains. “ReasoningBank changes this by turning every task experience (successful or failed) into a structured, reusable reasoning memory. As a result, the agent doesn’t start over with each customer; it remembers and adapts proven strategies from similar past cases.”

The framework takes both successful and failed experiences and turns them into a collection of useful strategies and preventive lessons. The agent judges success and failure through LLM-as-judge schemes to avoid the need for labeling on humans.

Yan provides a practical example of this process in action. An agent tasked with finding Sony headphones might fail because the broad search returns more than 4,000 irrelevant products. “ReasoningBank will first try to find out why this approach failed,” Yan said. “It will then distill strategies such as ‘optimize search’ and ‘narrow products with category filtering’. These strategies will be extremely useful in successfully completing future similar tasks.”

The process works in a closed loop. When faced with a new task, an agent uses an embedding-based search to retrieve relevant memories from ReasoningBank to guide its actions. These reminders are inserted into the agent’s system prompt, providing context for decision-making. Once the task is completed, the framework creates new memory entries to extract insights from successes and failures. This new knowledge is then analyzed, distilled and synthesized in the ReasoningBank, allowing the agent to continuously evolve and improve its capabilities.

See also  The Best Real Estate Website Designs for 2025 (+Why They Work)

Charging memory with scaling

The researchers found a powerful synergy between memory and memory Test time scaling. Classical test time scaling involves generating multiple independent answers to the same question, but the researchers argue that this “vanilla form is suboptimal because it does not take advantage of the inherent contrastive signal that comes from redundant exploration of the same problem.”

To address this, they propose Memory-aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS exists in two forms. With ‘parallel scaling’ the system generates multiple trajectories for the same question, then compares and contrasts them to identify consistent reasoning patterns. In sequential scaling, the agent iteratively refines its reasoning within a single trial, with the intermediate notes and corrections also serving as valuable memory cues.

This creates a virtuous cycle: the existing memory in ReasoningBank directs the agent to more promising solutions, while the diverse experiences generated by scaling allow the agent to create higher quality memories to store in ReasoningBank.

“This positive feedback loop positions memory-driven experience scaling as a new scaling dimension for agents,” the researchers write.

ReasoningBank in action

The researchers tested their framework WebArena (web surfing) and SWE-Bench verified (software engineering) benchmarks, using models such as Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet. They compared ReasoningBank against baselines, including memory-free agents and agents using trajectory-based or workflow-based memory frameworks.

The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. On WebArena, it improved the overall success rate by up to 8.3 percentage points compared to a memory-free agent. It also generalized better on more difficult, cross-domain tasks, while reducing the number of interaction steps required to complete tasks. When combined with MaTTS, both parallel and sequential scaling further improved performance, consistently outperforming standard test time scaling.

See also  Animal influencer Mike 'Real Tarzann' Holston is injured in a skydiving accident

This efficiency gain has a direct impact on operating costs. Yan points to a case where a memory-free agent took eight steps of trial and error to find the right product filter on a website. “These costs of trial and error can be avoided by leveraging relevant insights from ReasoningBank,” he noted. “In this case, we save almost double the operational costs,” which also improves the user experience by resolving issues faster.

For enterprises, ReasoningBank can help develop cost-effective agents that can learn from experience and adapt over time in complex workflows and in areas such as software development, customer support and data analytics. As the article concludes, “Our findings suggest a practical path for building adaptive and lifelong learning resources.”

Yan confirmed that their findings point to a future of truly compositional intelligence. For example, a coding agent can learn separate skills such as API integration and database management from separate tasks. “Over time, these modular skills become building blocks that the agent can flexibly combine to solve more complex tasks,” he said, suggesting a future where agents can autonomously gather their knowledge to manage entire workflows with minimal human supervision.

Source link

Back to top button