AI

Google’s new framework helps AI agents spend their compute and tool budget more wisely

In one new paper studying tool use among agents with a large language model (LLM), researchers from Google and UC Santa Barbara have developed a framework that allows agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple ‘Budget Tracker’ and a more extensive framework called ‘Budget Aware Test-time Scaling’. These techniques make agents explicitly aware of their remaining reasoning and uses.

Because AI agents rely on tool calls to operate in the real world, scaling test times is less about smarter models and more about controlling cost and latency.

For business leaders and developers, budget-conscious scaling techniques provide a practical way to deploy effective AI agents without dealing with unpredictable costs or diminishing returns on computing spend.

The challenge of scaling tool usage

Traditional Test time scaling focuses on making models ‘think’ longer. However, for agentic tasks such as web browsing, the number of tool calls directly determines the depth and breadth of exploration.

This introduces significant operational overhead for companies. “Tool calls such as browsing web pages result in more token consumption, increase context length and introduce additional time latency,” Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. “Toolcalls themselves incur additional API costs.”

The researchers found that simply allocating more testing time resources to agents does not guarantee better performance. “In a deep investigation task, if the agent has no sense of budget, they often go down blindly,” Wang and Liu explained. “It finds one somewhat related clue, then spends 10 or 20 tool calls digging for it, only to realize the entire path was a dead end.”

See also  Stephen Colbert backs public media groups amid federal budget cuts

Optimize resources with Budget Tracker

To evaluate how to optimize tool usage budgets, the researchers first tried a lightweight approach called ‘Budget Tracker’. This module acts as a plug-in that provides the agent with a continuous signal about resource availability, enabling budget-conscious use of the tool.

The team hypothesized that “providing explicit budget signals allows the model to internalize resource constraints and adjust strategy without the need for additional training.”

Budget Tracker works purely at prompt level and is therefore easy to implement. (The article provides full details on the prompts used for Budget Tracker, making it easy to implement.)

In Google’s implementation, the tracker provides a brief policy guideline describing budget regimes and associated recommendations for tool use. At each step of the response process, Budget Tracker makes the agent explicitly aware of its resource consumption and remaining budget, allowing it to determine next reasoning steps based on the updated resource status.

To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its output, and parallel scaling, where multiple independent runs are executed and aggregated. They conducted experiments with search agents equipped with search and browsing tools following a ReAct-like loop. ReAct (Reasoning + Acting) is a popular method in which the model alternates between internal thinking and external action. To track a real trend on the cost-performance scale, they developed a unified cost metric that jointly takes into account the costs of both internal token consumption and external tool interactions.

They tested Budget Tracker on three information-seeking QA datasets that required external search, including BrowseComp and HLE-Search, using models such as Twin 2.5 ProGemini 2.5 Flash, and Claude Sonnet4. The experiments show that this simple plugin improves performance under various budget constraints.

See also  6 days left for Regular Bird savings for Disrupt 2025 passes

“Adding Budget Tracker achieves similar accuracy with 40.4% fewer searches, 19.9% ​​fewer browsing calls, and a reduction in total costs… by 31.3%,” the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, while regular ReAct stagnated after a certain threshold.

BATS: A comprehensive framework for budget-conscious scaling

To further improve tool usage optimization, the researchers introduced Budget Aware Test-time Scaling (BATS), a framework designed to maximize agent performance within a given budget. BATS maintains a continuous signal of the remaining resources and uses this information to dynamically adjust the agent’s behavior as it formulates its response.

BATS uses multiple modules to orchestrate the agent’s actions. A planning module incrementally adjusts effort to match the current budget, while a verification module decides whether to “dig deeper” into a promising lead or “pivot” to alternate paths based on resource availability.

Given an information request and a budget for the tool call, BATS starts by using the planning module to formulate a structured action plan and decide which tools to deploy. When instruments are invoked, their responses are added to the reasoning sequence to provide the context with new evidence. When the agent proposes a candidate answer, the verification module verifies it and decides whether to continue the current sequence or retry with the remaining budget.

The iterative process ends when budgeted resources are exhausted, after which an LLM-as-judge selects the best answer from all verified answers. During execution, the Budget Tracker continuously updates both resource usage and remaining budget at each iteration.

The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks with baselines including standard ReAct and several training-based agents. Their experiments show that BATS achieves higher performance, uses fewer tool calls, and has lower overall costs than competing methods. With Gemini 2.5 Pro as the backbone, BATS achieved an accuracy of 24.6% on BrowseComp compared to 12.6% for standard ReAct, and 27.0% on HLE-Search compared to 20.5% for ReAct.

See also  Vocal Image is using AI to help people communicate better

BATS not only improves effectiveness under budget constraints, but also provides better cost-performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of about 23 cents compared to a parallel scaling baseline that required more than 50 cents to achieve a similar result.

According to the authors, this efficiency makes previously expensive workflows viable. “This unlocks a range of data-intensive, long-horizon business applications… such as complex codebase maintenance, due diligence, competitive landscape research, compliance audits and multi-step document analysis,” they say.

As companies look to deploy agents that manage their own resources, the ability to balance accuracy and cost will become a critical design requirement.

“We believe that the relationship between reasoning and economics will become inseparable,” Wang and Liu said. “In the future, [models] must reason about its value.”

Source link

Back to top button