Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Do you want smarter insights into your inbox? Register for our weekly newsletters to get only what is important for Enterprise AI, data and security leaders. Subscribe now
Japanese AI Lab Sakana Ai Has introduced a new technique with which several large language models (LLMS) can collaborate with a single task, so that a “dream team” of AI agents is effectively created. The method called Multi-LLM AB-MCTsAllows models to test and stand up and combine their unique strengths to solve problems that are too complex for each individual model.
For companies, this approach offers a means to develop robust and capable AI systems. Instead of being locked up in a single provider or model, companies can dynamically use the best aspects of different border models, so that the correct AI is allocated for the right part of a task to achieve superior results.
The power of collective intelligence
Frontier AI models are evolving quickly. However, each model has its own different strengths and weaknesses derived from its unique training data and architecture. The one could excel in coding, while another excels in creative writing. The researchers from Sakana AI claim that these differences are not a bug, but a function.
“We do not see these prejudices and varied construction as limitations, but as valuable resources for creating collective intelligence,” the researchers say in their Blog post. They believe that just like the greatest performance of humanity comes from different teams, AI systems can also achieve more by working together. “By bundling their intelligence, AI systems can solve problems that are insurmountable for each model.”
Think longer during the conclusion time
The new algorithm from Sakana AI is a technique “Inference-time scaling” (also known as “test time scaling”), a research area that has become very popular in the past year. Although the majority of the focus in AI is on “training time scaling” (Making models larger and they train on larger data sets), improves inference time scaling by allocating more computational sources after a model has already been trained.
A common approach includes the use of reinforcement education to ask models to generate longer, more detailed chain of thought (COT) sequences, as can be seen in popular models such as OpenAI O3 and Deepseek-R1. Another simpler method is repeated sampling, where the model gets the same prompt several times to generate a variety of potential solutions, similar to a brainstorming session. The work of Sakana AI combines and promotes these ideas.
“Our framework offers a smarter, more strategic version of the best-of-n (aka repeated sample),” Takuya Akiba, research scientist at Sakana Ai and co-author of the article, said Venturebeat said. “It complements the reasoning techniques such as Long COT via RL. By dynamically selecting the search strategy and the right LLM, this approach maximizes performance within a limited number of LLM calls, so that better results are delivered on complex tasks.”
How adaptive branching research works
The core of the new method is an algorithm called adaptive branching Monte Carlo Tree Search (AB-MCTs). It enables an LLM to effectively implement trial-and-error by balancing two different search strategies in intelligent way: “Deeper searching” and “wider searching”. Deeper searching includes taking a promising answer and repeatedly refining, while the broader searching means that it is generated completely new solutions. AB-MCTS combines these approaches, so that the system can improve a good idea, but can also run and try something new if it gets a dead end or discover another promising direction.
To achieve this, the system uses Monte Carlo Tree Search (MCTS), a decision -making algorithm that is used famous by DeepMind’s Alphago. With every step, AB-MCT’s uses probability models to decide whether it is more strategic to refine an existing solution or to generate a new one.

The researchers took this one step further with Multi-LLM AB-MCTs, who not only decide “what” to do (refine vs. generate) but also “what” llm should it do. At the start of a task, the system does not know which model is most suitable for the problem. It starts with trying a balanced mix of available LLMs and, as it progresses, learns which models are more effective, so that they allocate more of the workload to them over time.
Test the AI ’Dream Team’
The researchers tested their Multi-LLM AB-MCTS system on the ARC-Agi-2 Benchmark. ARC (abstraction and reasoning corpus) is designed to test a human capacity to solve new visual reasoning problems, making it notorious for AI.
The team used a combination of frontier models, including O4-Mini, Gemini 2.5 Pro and Deepseek-R1.
The collective of models was able to find correct solutions for more than 30% of the 120 test problems, a score that one of the models that worked alone considerably exceeded. The system showed the possibility to dynamically assign the best model for a certain problem. On tasks where a clear path to a solution existed, the algorithm quickly identified the most effective LLM and used it more often.

More impressive, the team observed cases in which the models spot problems that were previously impossible for one of them. In one case, a solution was generated by the O4-mini model incorrect. However, the system made this lack of attempt on Deepseek-R1 and Gemini-2.5 Pro, which could analyze, correct the error and ultimately produce the correct answer.
“This shows that Multi-LLM AB-MCT’s can combine flexible frontier models to solve insoluble problems, which means that the boundaries are pushed of what is feasible by using LLMs as a collective intelligence,” the researchers write.

“In addition to the individual advantages and disadvantages of each model, the tendency to hallucinate can vary considerably,” Akiba said. “By creating an ensemble with a model that is less likely, it could be possible to achieve the best of two worlds: powerful logical possibilities and strong orientation. Since hallucination is a major problem in a business context, this approach can be valuable for its limitation.”
From research to real applications
To help developers and companies apply this technique, Sakana Ai has released the underlying algorithm as an open-source framework called TreequestAvailable under an Apache 2.0 license (usable for commercial purposes). Treequest offers a flexible API, so that users can implement Multi-LLM AB-MCTs for their own tasks with adapted score and logic.
“Although we are in the early stages of applying AB-MCTs to specific business-oriented problems, our research in various areas reveals considerable potential,” Akiba said.
In addition to the ARC-Agi-2 benchmark, the AB-MCTs team was able to successfully apply to tasks such as complex algorithmic coding and improving the accuracy of machine learning models.
“AB-MCTs can also be very effective for problems that require iterative trial-and-error, such as optimizing performance statistics of existing software,” Akiba said. “For example, it can be used to automatically find ways to improve the reaction mode of a web service.”
The release of a practical, open-source tool can clear the way for a new class of more powerful and reliable Enterprise AI applications.
Source link




