EAGLET boosts AI agent performance on longer-horizon tasks by generating custom plans

October 15, 2025

2 5 minutes read

2025 should have been the year of ‘AI agents’, according to Nvidia CEO Jensen Huang and other AI industry personnel. And in many ways that has been the case with numerous leading AI model providers, such as OpenAI, Google, and even Chinese competitors like Alibaba, which have released finely tuned AI models or applications designed to focus on a narrow set of tasks, such as web searching and report writing.

But one major hurdle to a future of high-performance, reliable AI agents remains: ensuring they stay on task when the task extends over a number of steps. Third-party benchmark tests show that even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer they spend on it (more than hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents – without the need for manual data labeling or retraining.

Developed by researchers from Tsinghua University, Peking University, DeepLang AI and the University of Illinois Urbana-Champaign, EAGLET provides a ‘global scheduler’ that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a sophisticated language model that interprets task instructions (usually given as prompts by the user or the agent’s operating environment) and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but the upfront guidance helps reduce planning errors and improve task completion rates.

Addressing the Scheduling Problem in Long-Horizon Agents

Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to experimental behavior, planning hallucinations and inefficient processes.

EAGLET addresses this limitation by introducing a global planning module who works with the executor agent.

Rather than combining planning and action generation into a single model, EAGLET separates them, allowing for more coherent strategies at the task level.

A two-stage training pipeline without human annotations

EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

The first phase involves generating synthetic plans with high-capacity LLMs such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a new strategy called homologous consensus filtering, which retains only the plans that improve task performance for both experienced and novice performers.

In the second phase, a rule-based learning process further refines the planner, using a custom-designed reward function to assess how well each plan helps multiple agents succeed.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high and low ability agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to promote shorter, more efficient task paths. This approach avoids over-rewarding plans that are only useful to already competent agents and promotes more generalizable planning guidelines.

Compatible with existing agents and models

The EAGLET scheduler is designed to be modular and plug-and-play, meaning it can be inserted into existing agent pipelines without the need to retrain the executor.

In evaluations, the planner has improved the performance of a variety of fundamental models, including GPT-4.1, GPT-5, Llama-3.1 and Qwen2.5.

It also proved effective regardless of prompt strategy, working well with standard ReAct-style prompts and approaches such as Reflexion.

State-of-the-art performance in all benchmarks

EAGLET was tested on three commonly used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based laboratory environment; ALFWorld, which tasks agents with completing housekeeping activities using natural language in a simulated home environment; and WebShop, which evaluates goal-oriented behavior in a realistic online shopping interface.

Across all three, EAGLET-equipped executors outperformed their non-scheduling counterparts and other scheduling baselines, including MPO and KnowAgent.

In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET increased average performance from 39.5 to 59.4, a gain of +19.9 points across all tasks.

In ScienceWorld’s unseen scenarios, it increased performance from 42.2 to 61.6.

In the scenarios seen by ALFWorld, EAGLET improved results from 22.9 to 54.3, a performance improvement of more than 2.3x.

Even stronger gains were made with more capable models.

For example, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 increased from 84.5 to 88.1, despite already being strong performers.

In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with ALFWorld’s ETO invisible task executor method.

Compared to other scheduling baselines such as MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld invisible tasks with GPT-4.1, MPO scored 79.1, while EAGLET scored 83.6: a lead of +4.5 points.

Additionally, the article reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, the average number of steps decreased from 13.0 (no scheduler) to 11.1 (EAGLET). With GPT-5 it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency gains in training and execution

Compared to RL-based methods such as GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with about one-eighth of the training effort.

This efficiency extends to execution: agents using EAGLET typically required fewer steps to complete tasks. This translates into reduced inference time and computational costs in production scenarios.

No public code yet

As of the version submitted to arXiv, the authors have not released an open source implementation of EAGLET. It is unclear if and when the code will be released, under what license, or how it will be maintained, which may limit the framework’s usefulness for enterprise deployment in the near term.

VentureBeat has reached out to the authors to clarify these points and will update this piece as soon as we hear back.

There are still questions about enterprise implementation

Although the scheduler is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or whether it will require a custom stack to support the separation of plan and execution.

Similarly, the training setup uses multiple executing agents, which can be difficult to replicate in enterprise environments with limited model access. VentureBeat asked the researchers if the homologous consensus filtering method could be adapted for teams that only have access to a single executor model or limited computing resources.

The authors of EAGLET report success with different model types and sizes, but it is not yet known what the minimum feasible model scale is for practical implementation. For example, can business teams effectively use the scheduler with open models with sub-10B parameters in latency-sensitive environments? Additionally, the framework can provide industry-specific value in areas such as customer support or IT automation, but it remains to be seen how easily the scheduler can be refined or adapted for such industries.

Real-time versus pre-generated planning

Another open question is how EAGLET can best be used in practice. Should the scheduler operate in real time alongside executors within a loop, or is it better used offline to generate global plans in advance for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat asked the authors this question and will report any insights.

Strategic considerations for enterprise teams

For technical leaders at mid-to-large enterprises, EAGLET represents a compelling proof-of-concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Companies must weigh the potential gains in task execution and efficiency against the costs of reproducing or aligning the training process internally.

Potential use cases in corporate settings

For companies developing agentic AI systems – especially in environments that require step-by-step planning, such as IT automation, customer support or online interactions – EAGLET provides a template for integrating planning without retraining. Its ability to guide both open and closed source models, along with its efficient training method, can make it an attractive starting point for teams looking to improve agent performance with minimal overhead.

Source link

October 15, 2025

2 5 minutes read