Has this stealth startup finally cracked the code on enterprise AI agent reliability? Meet AUI's Apollo-1


For more than ten years, conversational AI has promised human -like assistants who can chat more than chat. But even now large language models (LLMs) such as Chatgpt, Gemini and Claude Learning to Reason, Explain and Coden, one crucial category of interaction remains largely unsolved: reliably performing tasks for people out of chat.
Even the Best AI models only score in the 30th percentile on Terminal bench hard, A third-party benchmark designed to evaluate the performance of AI agents when completing a variety of browser-based tasks, far below the reliability required by most companies and users. And task -specific benchmarks such as Tau-Bench airline, He measures the Reliability of AI agents in finding and booking flights on behalf of a user, don’t have much higher success rates Only 56% for the best performing agents and models (Claude 3.7 Sonnet) – which means that the agent fails almost half the time.
Located in New York City Augmented Intelligence (AUI) Inc.Formed by OHAD ELHELO And Ori Cohenbelieves that it finally came up with a solution to increase the reliability of AI agents to a level at which most companies can trust that they will reliable what they are instructed to.
The new foundation model of the company, called Apollo-1 – that is still in preview with early testers, but is close to an upcoming general release – is built on a principle that it calls Stateful Neuro-symbolic reasoning.
It is a hybrid architecture that is defended by Even LLM sceptics such as Gary MarcusDesigned to guarantee consistent, policy -based results with every customer interaction.
“Conversational AI consists essentially of two halves,” Elhelo said in a recent interview with Venturebeat. “The first half – the dialogue with an open ending – is beautifully handled by LLMs. They are designed for creative or exploratory use. The other half is a task -oriented dialogue, where there is always a specific goal behind the conversation. That half has remained unsolved because it requires certainty.”
AUI defines security If the difference between an agent who ‘probably’ performs a task and an agent who almost ‘always’ does that.
On for example Tau-Bench Airline performs it with a stunning success rate of 92.5%As a result of which all other current competitors leave far behind in the dust – according to benchmarks shared with Venturebeat and Posted on the AUI website.
Elhelo gave simple examples: a bank that must enforce identity verification for reimbursements of more than $ 200, or an airline that always has to offer an upgrade to business class before the Economy Class.
“Those are not preferences,” he said. “They are demands. And no pure generative approach can offer that kind of behavioral security.”
AUI and his work to improve reliability were previously discussed in the news media for subscriptions The informationBut has not received any broad attention in the publicly accessible media so far.
From pattern recognition to predictable action
The team claims that transformer models cannot meet this bar due to their design. Large language models generate plausible text, no guaranteed behavior. “If you tell an LLM to always offer insurance before payment, it can usually be the case,” said Elhelo. “Configure Apollo-1 with that rule, and that will always happen.”
According to him, that distinction stems from the architecture itself. Transformers predict the following token in a series. Apollo-1, on the other hand, predicts the next action In a conversation, operating on what aui calls one Typed symbolic state.
Cohen explained the idea in more technical terms. “Neuro symbolic means that we merge the two dominant paradigms,” he said. “The symbolic layer gives you structure – he knows what a intention, an entity and a parameter are – while the neural layer gives your language skills. The neuro -symbolic orator is in between. It is a different kind of brain for dialogue.”
Where transformers treat every output as the generation of text, Apollo-1 carries out a closed reasoning loop: an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision-making machine determines the following action, a planner performs it, and a decoder turns the result into language. “The process is iterative,” says Cohen. “It continues until the task is complete. This way you get determinism instead of probability.”
A basic model for task performance
Unlike traditional chatbots or customized automation systems, Apollo-1 is intended as a foundation for task -oriented dialogue – a single domain -independent system that can be configured for banking, travel, retail or insurance via what aui a System prompt.
“The system prompt is not a configuration file,” said Elhelo. “It is a behavioral contract. You define exactly how your agent must behave in interesting situations, and Apollo-1 guarantees that that behavior will be carried out.”
Organizations can use the prompt to cod symbolic slots (intentions, parameters and policy), as well as tool limits and status -dependent rules.
For example, an app for delivery can enforce ‘If an allergy is mentioned, always inform the restaurant’, while a telecom provider could define ‘after three failed payment attempts suspend the service’. In both cases the behavior is performed deterministically and not statistically.
Eight years in the making
The road from AUI to Apollo-1 started in 2017, when the team started coding millions of real task-oriented conversations that were handled by a workforce of 60,000 people.
That work led to a symbolic language that could separate procedural knowledge – steps, limitations and currents – of description Such as entities and attributes.
“The insight was that task -oriented dialogue has universal patterns,” says Elhelo. “Food delivery, claim processing and order management all share similar structures. As soon as you explicitly model it, you can calculate deterministic about it.”
From there, the company built the Neuro-symbolic reasoner: a system that uses the symbolic state to decide what happens next, instead of guessing via token forecast.
Benchmarks suggest that the architecture makes a measurable difference.
According to AUI’s own evaluations, Apollo-1 achieved more than 90 percent task voltage on the τ bench-airline benchmark, compared to 60 percent For Claude-4.
It is complete 83 percent From Live Booking Chats on Google Flights versus 22 percent for Gemini 2.5-flash, and 91 percent From retail scenarios on Amazon versus 17 percent For Rufus.
“These are not step -by -step improvements,” said Cohen. “They are reliability of order of size.”
A supplement, not a competitor
AUI does not present Apollo-1 as a replacement for large language models, but as their necessary counterpart. In the words of Elhelo: “Optimizing transformers for creative probability. Apollo-1 optimizes for behavioral security. Together they form the entire spectrum of conversational AI.”
The model is already running in limited pilots with unnamed Fortune 500 companies in sectors such as the financial sector, the travel industry and the retail trade.
AUI has also confirmed a Strategic partnership with Google and plan for General availability in November 2025When it will open APIs, release full documentation and add speech and image options. Interested potential customers and partners can register to receive more information when the time comes is available on the AUI website form.
Until then, the company keeps the details secret. When he was asked what would happen next, Elhelo smiled. “Let’s say we are preparing an announcement,” he said. “Soon.”
On the way to conversations that take action
Despite all its technical refinement, the pitch of Apollo-1 is simple: make AI that companies can trust that it will trade-and not just talking. “We are on a mission to democratize access to AI that works,” Cohen said at the end of the interview.
Whether Apollo-1 is the new standard for task-oriented dialogue can still be seen. But if the architecture of AUI performs as promised, the long -standing gap between chatbots that sound human and agents who do human work in a reliable way can finally begin to close.




