AI

Together AI's ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

Companies expanding AI implementations are hitting an invisible performance wall. The perpetrator? Static speculators who cannot keep up with the changing workload.

Speculators are smaller AI models that work alongside large language models during inference. They prepare multiple tokens in advance, which the main model then verifies in parallel. This technique (called speculative decoding) has become essential for companies trying to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.

Together AI today announced research and a new system called ATLAS (AdapTive-LeArning Speculator System), which aims to help companies overcome the challenge of static speculators. The technique provides a machine learning capability for inference optimization that can help deliver up to 400% faster inference performance than the base level of performance available in existing inference technologies such as vLLM. The system addresses a critical problem: as AI workloads evolve, inference speeds slow down, even when specialized speculators are present.

The company that got his start in 2023, is aimed at optimizing inference on its enterprise AI platform. Earlier this year, the company made $305 million raised as customer adoption and demand have grown.

“Companies we work with generally see, as they scale, a shifting workload, and then they don’t see as much acceleration of speculative execution as they did before,” Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. “These speculators generally don’t work well when their workload starts to shift.”

The problem of workload that no one talks about

Most speculators in production today are ‘static’ models. They are trained once on a fixed dataset representing the expected workload, and then deployed without any possibility of adjustment. Companies like Meta and Mistral ship pre-trained speculators alongside their main models. Inference platforms like vLLM use these static speculators to increase throughput without changing output quality.

But there’s a catch. As a company’s AI use evolves, the accuracy of the static speculator drops.

“If you’re a company that produces coding agents, and most of your developers have been writing in Python, then suddenly some of them switch to writing Rust or C, then you see the speed start to drop,” Dao explains. “The speculator has a mismatch between what he is trained for and what the actual workload is.”

See also  Meteomatics eyes U.S. expansion for its enterprise-focused weather forecasting tools

This workload drift represents a hidden burden on scaling AI. Companies accept reduced performance or invest in retraining speculators. That process is just a snapshot and becomes outdated quickly.

How adaptive speculators work: A two-model approach

ATLAS uses a dual-speculator architecture that combines stability with customization:

The static speculator – A heavyweight model trained on broad data ensures consistent baseline performance. It serves as a ‘speed floor’.

The adaptive speculator – A lightweight model continuously learns from live traffic. It specializes in emerging domains and usage patterns.

The trust-conscious controller – An orchestration layer dynamically chooses which speculator to use. It adjusts the speculative lookahead based on confidence scores.

“Before the adaptive speculator learns anything, we still have the static speculator that provides the speed boost at the beginning,” Ben Athiwaratkun, staff AI scientist at Together AI, explained to VentureBeat. “Once the adaptive speculator becomes more confident, the speed grows over time.”

The technical innovation lies in balancing the acceptance rate (how often the target model agrees to the drafted tokens) and the latency of the concepts. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and expands the look ahead. This ensures performance gains.

Users do not need to tune any parameters. “On the user side, users don’t have to turn any buttons,” Dao said. “On our end, we’ve turned these knobs so that users can adjust them into a configuration that provides good speed gains.”

Performance that rivals custom silicon

AI’s testing shows that ATLAS reaches 500 tokens per second on DeepSeek-V3.1 when fully customized. Even more impressive, these numbers on Nvidia B200 GPUs match or exceed specialized inference chips Groqs custom hardware.

“The software and algorithmic improvements can bridge the gap with truly specialized hardware,” Dao said. “We saw 500 tokens per second on these huge models that are even faster than some of the custom chips.”

The 400% speedup the company claims for inference represents the cumulative effect of Together’s Turbo optimization suite. FP4 quantization provides an 80% speedup over the FP8 baseline. The static Turbo Speculator adds another 80-100% gain. The adaptive system comes on top. Each optimization combines the benefits of the other.

See also  Cerebras Introduces World's Fastest AI Inference Solution: 20x Speed at a Fraction of the Cost

Compared to standard inference engines such as vLLM or Nvidia’s TensorRT-LLM, the improvement is significant. Together, AI compares the stronger baseline between the two for each workload before applying speculative optimizations.

The trade-off between memory and processing power explained

The performance gain comes from exploiting a fundamental inefficiency in modern inference: wasted computing power.

Dao explained that during inference, a large portion of the computing power is typically not fully utilized.

“During inference, which is actually the dominant workload today, you’re mostly using the memory subsystem,” he said.

Speculative decoding trades idle computing power for reduced memory access. When a model generates one token at a time, it is memory bound. The GPU remains idle while waiting for memory. But when the speculator proposes five tokens and the target model verifies them simultaneously, computing usage increases while memory access remains approximately constant.

“The total amount of computing power to generate five tokens is the same, but you only had to access the memory once, instead of five times,” Dao said.

Think of it as intelligent caching for AI

For infrastructure teams familiar with traditional database optimization, adaptive speculators function like an intelligent caching layer, but with a crucial difference.

Traditional caching systems like Redis or memcached require exact matches. You store and retrieve the exact same query result when that specific query is run again. Adaptive speculators work differently.

“You can think of it as an intelligent way of caching, not exactly storing, but figuring out some patterns that you see,” Dao explains. “Generally speaking, we see that you’re working with similar code, or with similar, you know, controlling the computing power in a similar way. We can then predict what the big model is going to say. We’re just getting better and better at predicting that.”

Instead of storing exact answers, the system learns patterns in the way the model generates tokens. It recognizes that editing Python files in a specific codebase makes certain token sequences more likely. The speculator adapts to these patterns and improves his predictions over time without requiring identical input.

See also  Real-time photos of the ongoing US snowstorm

Use cases: RL training and evolving workloads

Two business scenarios especially benefit from adaptive speculators:

Reinforcement learning training: Static speculators quickly get out of step as policies evolve during training. ATLAS continuously adapts to the changing policy distribution.

Evolving workload: As companies discover new AI use cases, the composition of the workload is changing. “Maybe they started using AI for chatbots, but then they realized it can write code, so they switched to code,” Dao says. “Or they realize that these AIs can actually call tools and control computers and do accounting and things like that.”

During a vibe coding session, the adaptive system can specialize for the specific codebase being edited. These are files that are not seen during training. This further increases the adoption rate and decryption speed.

What it means for enterprises and the inference ecosystem

ATLAS is now available at Together AI’s dedicated endpoints as part of the platform at no additional cost. The company’s more than 800,000 developers (up from 450,000 in February) have access to the optimization.

But the broader implications extend beyond one supplier’s product. The shift from static to adaptive optimization represents a fundamental rethinking of how inference platforms should work. As companies deploy AI across multiple domains, the industry will need to move from one-off trained models to systems that continuously learn and improve.

Together AI has historically released some of its research techniques as open source and collaborated with projects like vLLM. While the fully integrated ATLAS system is proprietary, some of the underlying techniques can ultimately impact the broader inference ecosystem.

For companies looking to lead in AI, the message is clear: adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach continues to evolve across the industry, software optimization will increasingly outpace specialized hardware.

Source link

Back to top button