Google’s ‘Nested Learning’ paradigm could solve AI's memory and continual learning problem

November 22, 2025

2 4 minutes read

Researchers at Google have developed a new AI paradigm aimed at solving one of the biggest limitations of today’s large language models: their inability to learn or update their knowledge after training. The paradigm, called Nested learningreformulates a model and its training not as a single process, but as a system of nested multi-level optimization problems. The researchers claim that this approach can unlock more expressive learning algorithms, leading to better in-context learning and memory.

To prove their concept, the researchers used Nested Learning to develop a new model called Hope. Early experiments show that it delivers superior performance in language modeling, continuous learning, and long-context reasoning tasks, potentially paving the way for efficient AI systems that can adapt to real-world environments.

The memory problem of large language models

Deep learning algorithms helped obviate the need for the careful engineering and domain expertise that traditional machine learning requires. By feeding models huge amounts of data, they could learn the necessary representations on their own. However, this approach came with its own challenges that could not be solved by simply stacking more layers or creating larger networks, such as generalizing to new data, continuously learning new tasks, and avoiding suboptimal solutions during training.

Efforts to overcome these challenges have led to the innovations that led to them Transformersthe basis of today’s large language models (LLMs). These models have “heralded a paradigm shift from task-specific models to more general-purpose systems with distinct emerging capabilities as a result of scaling the ‘right’ architectures,” the researchers write. Yet a fundamental limitation remains: LLMs are largely static after training and cannot update their core knowledge or acquire new skills through new interactions.

The only customizable part of an LLM is the contextual learning ability, allowing it to perform tasks based on the information provided immediately. This makes current LLMs analogous to a person who cannot form new long-term memories. Their knowledge is limited to what they learned during pre-training (the distant past) and what is in their current context window (the immediate present). Once a conversation moves beyond the context window, that information is lost forever.

The problem is that current transformer-based LLMs have no mechanism for ‘online’ consolidation. Information in the context window never updates the model’s long-term parameters: the weights stored in the feed-forward layers. As a result, the model cannot permanently acquire new knowledge or skills from interactions; anything it learns disappears as soon as the context window rolls over.

A nested approach to learning

Nested Learning (NL) is designed to enable computer models to learn from data using different levels of abstraction and timescales, just like the brain. It treats a single machine learning model not as one continuous process, but as a system of interconnected learning problems that are optimized simultaneously and at different rates. This differs from the classical view, which treats the architecture of a model and the optimization algorithm as two separate components.

According to this paradigm, the training process is seen as developing “associative memory,” the ability to connect and recall related pieces of information. The model learns to assign a data point to the local error, which measures how “surprising” that data point was. Even important architectural components such as the attention mechanism in transformers can be viewed as simple associative memory modules that learn mappings between tokens. By defining an update frequency for each component, these nested optimization problems can be classified into different ‘levels’, which are the core of the NL paradigm.

Hope for continuous learning

The researchers put these principles into practice with Hope, an architecture designed to embody Nested Learning. Hope is a modified version of Titansanother architecture that Google introduced in January to address the memory limitations of the transformer model. Although Titans had a powerful memory system, its parameters were only updated at two different rates: a long-term memory module and a short-term memory mechanism.

Hope is a self-modifying architecture, complemented by a “Continuum Memory System” (CMS) that enables limitless levels of in-context learning and can scale to larger context windows. The CMS acts as a series of memory banks, each updated at a different frequency. Banks that update faster process instantaneous information, while slower banks consolidate more abstract knowledge over longer periods of time. This allows the model to optimize its own memory in a self-referential loop, creating an architecture with theoretically infinite levels of learning.

Across a diverse set of language modeling and common-sense reasoning tasks, Hope demonstrated lower perplexity (a measure of how well a model predicts the next word in a sequence and maintains coherence in the text it generates) and higher accuracy compared to both standard transformers and other modern recurrent models. Hope also performed better on long-context “needle-in-haystack” tasks, which require a model to find and use a specific piece of information hidden in a large amount of text. This suggests that the CMS provides a more efficient way to process long strings of information.

This is one of many attempts to create AI systems that process information at different levels. Hierarchical reasoning model (HRM) from Sapient Intelligence, used a hierarchical architecture to make the model more efficient at learning reasoning tasks. Small reasoning model (TRM), a model from Samsung, improves HRM by making architectural changes, improving performance while making it more efficient.

While promising, Nested Learning faces some of the same challenges as these other paradigms in realizing its full potential. Current AI hardware and software stacks are highly optimized for classic deep learning architectures and Transformer models in particular. Adopting Nested Learning at scale may require fundamental changes. However, if it gains momentum, it could lead to much more efficient LLMs that can continuously learn, a capability that is crucial for real-world business applications, where environments, data, and user needs are constantly evolving.

Source link