ACE prevents context collapse with ‘evolving playbooks’ for self-improving AI agents


A new framework of Stanford University And Samba Nova addresses a crucial challenge in building robust AI agents: context engineering. Called Agentic context technique (ACE), the framework automatically populates and modifies the context window of Large Language Model (LLM) applications by treating it as an “evolving playbook” that creates and refines strategies as the agent gains experience in its environment.
ACE is designed to overcome key limitations of other context engineering frameworks, preventing the model’s context from deteriorating as more information is collected. Experiments show that ACE works both for optimizing system prompts and managing agent memory, outperforming other methods while being significantly more efficient.
The challenge of context engineering
Advanced AI applications using LLMs rely largely on “context adaptation” or context engineering to guide their behavior. Instead of the costly process of retraining or refining the model, developers use the LLMs in-context learning ability to direct its behavior by customizing the input prompts with specific instructions, reasoning steps, or domain-specific knowledge. This additional information is usually obtained when the agent interacts with its environment and collects new data and experiences. The main goal of context engineering is to organize this new information in a way that improves model performance and avoids confusion. This approach is becoming a central paradigm for building capable, scalable, and self-improving AI systems.
Context engineering has several advantages for business applications. Contexts are interpretable by both users and developers, can be updated with new knowledge at runtime, and can be shared between different models. Context engineering also benefits from continued advances in hardware and software, such as the growing context windows of LLMs and efficient inference techniques such as prompt and context caching.
There are several automated context engineering techniques, but most of them suffer from two major limitations. The first is a ‘brevity bias’, where fast optimization methods favor concise, general instructions over extensive, detailed instructions. This can undermine performance in complex domains.
The second, more serious problem is ‘collapse of context’. When an LLM is tasked with repeatedly rewriting the entire constructed context, he may suffer from a kind of digital amnesia.
“What we call ‘context collapse’ happens when an AI tries to rewrite or compress everything it has learned into a single new version of its prompt or memory,” the researchers said in written comments to VentureBeat. “Over time, that rewriting process erases important details, such as rewriting a document so many times that important notes disappear. In customer-facing systems, this can mean a support agent suddenly loses awareness of previous interactions… causing erratic or inconsistent behavior.”
The researchers state that “contexts should function not as brief summaries, but as comprehensive, evolving playbooks – detailed, inclusive, and rich in domain insights.” This approach draws on the power of modern LLMs, which can effectively distill relevance from long and detailed contexts.
How Agentic Context Engineering (ACE) works
ACE is a comprehensive context adaptation framework designed for both offline tasks and optimization of system promptsand online scenarios, such as real-time memory updates for agents. Rather than compressing information, ACE treats context as a dynamic playbook that collects and organizes strategies over time.
The framework divides the work into three specialized roles: a generator, a reflector and a curator. This modular design is inspired by “the way people learn – experiment, reflect and consolidate – while avoiding the bottleneck of overloading a single model with all the responsibilities,” according to the paper.
The workflow starts with the Generator, which produces reasoning paths for input prompts, highlighting both effective strategies and common mistakes. The Reflector then analyzes these paths to extract important lessons. Finally, the curator synthesizes these lessons into compact updates and merges them into the existing playbook.
To avoid context collapse and brevity, ACE includes two key design principles. First, it uses incremental updates. The context appears as a collection of structured, itemized bullet points rather than a single block of text. This allows ACE to make detailed changes and retrieve the most relevant information without rewriting the entire context.
Second, ACE uses a grow-and-refine mechanism. As new experiences are collected, new bullets are added to the playbook and existing ones are updated. A deduplication step regularly removes redundant entries, keeping the context comprehensive, yet relevant and compact, over time.
ACE in action
The researchers evaluated ACE on two types of tasks that take advantage of the changing context: agent benchmarks that require multi-turn reasoning and tool use, and domain-specific financial analysis benchmarks that require specialized knowledge. For high-stakes industries like finance, the benefits extend beyond pure performance. As the researchers said, the framework is “much more transparent: a compliance officer can literally read what the AI has learned because it is stored in human-readable text instead of hidden in billions of parameters.”
The results showed that ACE consistently outperformed strong baselines such as GEPA and classical in-context learning, achieving an average performance gain of 10.6% on agent tasks and 8.6% on domain-specific benchmarks, both offline and online.
Crucially, ACE can build effective contexts by analyzing feedback from its actions and environment, rather than requiring manually labeled data. The researchers note that this ability is a “key ingredient for self-improving LLMs and agents.” On the public AppWorld benchmark, designed to evaluate agentic systems, an agent using ACE with a smaller open source model (DeepSeek-V3.1) matched the performance of the top-ranked, GPT-4.1 powered agent average and exceeded it on the more difficult test set.
The takeaway for businesses is significant. “This means companies don’t have to rely on massive proprietary models to stay competitive,” the research team said. “They can deploy local models, protect sensitive data, and still achieve top results by continuously refining the context instead of retraining the weights.”
In addition to accuracy, ACE proved to be very efficient. It adapts to new tasks with an average of 86.9% lower latency than existing methods and requires fewer steps and tokens. The researchers point out that this efficiency shows that “scalable self-improvement can be achieved with both higher accuracy and less overhead.”
For companies concerned about inference costs, the researchers point out that the longer contexts produced by ACE do not translate into proportionately higher costs. Modern supporting infrastructures are increasingly optimized for long-context workloads, with techniques such as KV cache reuse, compression, and offloading, offsetting the cost of processing long-context.
Ultimately, ACE points to a future where AI systems are dynamic and constantly improving. “Today, only AI engineers can update models, but context engineering opens the door for domain experts (lawyers, analysts, physicians) to directly shape what the AI knows by editing the contextual playbook,” the researchers said. This also makes management more practical. “Selective unlearning becomes much more tractable: if a piece of information is outdated or legally sensitive, it can simply be removed or replaced in context, without retraining the model.”




