AI

AI’s capacity crunch: Latency risk, escalating costs, and the coming surge-pricing breakpoint

The latest big headline in AI isn’t model size or multimodality; it is the capacity crisis. At VentureBeat’s latest AI Impact stop in New York, Val Bercovici, Chief AI Officer at WEKAjoined Matt Marshall, CEO of VentureBeat, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.

These forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, making real-time market rates available for ride sharing for the first time. Now, Bercovici argued, AI is headed for the same economic reckoning – especially in inference – when the focus shifts to profitability.

“We don’t have real market rates today. We have subsidized rates. That was necessary to enable a lot of the innovation, but sooner or later – given the trillions of dollars of capex we’re talking about now, and the finite energy opex – real market rates will appear; maybe next year, certainly in 2027,” he said. “If they do, it will fundamentally change this industry and drive an even deeper, sharper focus on efficiency.”

The economics of the token explosion

“The first rule is that this is an industry where more is more. More tokens equal exponentially more business value,” said Bercovici.

But so far no one has discovered how to make that sustainable. The classic business triad – cost, quality and speed – translates in AI to latency, cost and accuracy (especially in output tokens). And accuracy is non-negotiable. This applies not only to consumer interactions with agents like ChatGPT, but also to high-stakes applications such as drug discovery and business workflows in heavily regulated industries such as financial services and healthcare.

See also  Apple is in talks to use Google's Gemini for Siri revamp, report says

“That’s non-negotiable,” Bercovici said. “You need to have a large number of tokens for high inference accuracy, especially if you add security to the mix, guardrail models and quality models. Then you trade off latency and cost. That’s where you have some flexibility. If you can tolerate high latency, and sometimes for consumer use, then you can have lower costs, with free tiers and low cost-plus tiers.”

However, latency is a critical bottleneck for AI agents. “These officers are now no longer operating in any sense of the word. You either have a swarm of officers or no officer activity at all,” Bercovici noted.

In a swarm, groups of agents work in parallel towards a larger goal. An Orchestrator agent (the smartest model) sits at the center and defines the subtasks and key requirements: architecture choices, cloud versus on-premises execution, performance limitations, and security considerations. The swarm then executes all subtasks, effectively activating numerous concurrent inference users in parallel sessions. Finally, evaluation models assess whether the overall task has been completed successfully.

“These swarms go through what’s called multiple turns, hundreds if not thousands of prompts and responses, until the swarm comes together with an answer,” Bercovici said.

“And if you have compound delay over those thousand turns, it becomes unsustainable. So latency is really, really important. And that means that today you typically have to pay a high price that is subsidized, and that’s what will have to happen over time.”

Reinforcement learning as the new paradigm

Until about May of this year, agents weren’t performing very well, Bercovici explains. And then context windows became large enough and GPUs available enough to support agents that could perform advanced tasks like writing reliable software. It is now estimated that in some cases 90% of software is generated by coding agents. Now that agents have effectively come of age, Bercovici notes, reinforcement learning is the new conversation among data scientists at some of the leading labs, such as OpenAI, Anthropic and Gemini, who see it as a crucial path forward in AI innovation.

See also  The Reinforcement Gap — or why some AI skills improve faster than others  

“The current AI season is reinforcement learning. It combines many elements of training and inference into one unified workflow,” said Bercovici. “It’s the latest and greatest scaling law for this mythical milestone we’re all trying to achieve called AGI – artificial general intelligence,” he added. “What’s fascinating to me is that you have to apply all the best practices of how you train models, plus all the best practices of how you infer models, to be able to iterate on these thousands of reinforcement learning loops and move the entire field forward.”

The road to AI profitability

There is no one-size-fits-all answer when it comes to building an infrastructure foundation to make AI profitable, Bercovici said, because it is still an emerging field. There is no cookie-cutter approach. Going completely on-prem may be the right choice for some, especially frontier modelers, while going cloud-native or running in a hybrid environment may be a better path for organizations looking to innovate flexibly and responsively. Regardless of which path they initially choose, organizations will need to adapt their AI infrastructure strategy as their business needs evolve.

“The unit economy is fundamental here,” Bercovici said. “We’re certainly in a boom, or even a bubble, in some cases you might say, because the underlying AI economy is subsidized. But that doesn’t mean that as tokens become more expensive, you won’t use them anymore. You’ll just get very granular in terms of how you use them.”

Leaders should focus less on the pricing of individual tokens and more on the economics at the transaction level, where efficiency and impact become visible, Bercovici concludes.

See also  Trump paused rates, but housing builders still see a long -term risk

The central question companies and AI companies should ask, Bercovici said, is: “What is the real cost to my unit economy?”

Viewed through that lens, the way forward is not about doing less with AI, but about doing it smarter and more efficiently at scale.

Source link

Back to top button