Silicon Valley bets big on ‘environments’ to train AI agents

For years, Big Tech CEOs have recommended visions of AI agents who can use software applications autonomously to complete tasks for people. But take today’s current AI agents with you for a spider, whether it is the Chatgpt agent from OpenAi or the comet of Pertlexity, and you will quickly realize how limited the technology is. AI agents can make a new series of techniques that still discover the industry.
One of those techniques is carefully simulating workplaces where agents can be trained on Multi-Step-tasks-known as RL environments (strengthening the learning theory). Just like labeled datasets, the last wave of AI, RL environments are starting to look like a crucial element in the development of agents.
AI researchers, founders and investors tell WAN that leading AI Laboratories now demand more RL environments, and there is no shortage of startups that hope to deliver.
“All major AI laboratories build RL environments in-house,” said Jennifer Li, general partner at Andreessen Horowitz, in an interview with WAN. “But as you can imagine, making these datasets is very complex, so AI Labs also look at third -party suppliers who can create environments and evaluations of high quality. Everyone looks at this space.”
The urge for RL environments has made a new class well-funded startups, such as mechanize work and prime intellect, which are aimed at leading the space. In the meantime, large data-labeling companies such as Mercor and Surge say that they invest more in RL environments to keep pace with the shifts of the static data sets to interactive simulations. The most important laboratories also consider investing heavily: according to the information, Anthropic leaders have discussed more than expenditure $ 1 billion on RL environments The following year.
The hope for investors and founders is that one of these startups comes to the fore as the “scale AI for environments”, referring to the Powerhouse of $ 29 billion data tennis that the Chatbot era has driven.
The question is whether RL environments will really suppress the boundary of AI preface.
WAN event
San Francisco
|
27-29 October 2025
What is a RL environment?
In essence, RL environments are training grounds that simulate what an AI agent would do in a real software application. A founder described the construction of them in Recent interview “Like making a very boring video game.”
For example, an environment can simulate a Chrome -Browser and an AI agent tasks with buying a few socks on Amazon. The agent is assessed for his performance and has sent a reward signal when it succeeds (in this case buy a worthy pair of socks).
Although such a task sounds relatively simple, there are many places where an AI agent can be stumbled. It can be lost by navigating through the drop -down menus’ drop -down menus or buying too many socks. And because developers cannot predict exactly what the wrong turn an agent will take, the environment itself must be robust enough to capture unexpected behavior and yet to give useful feedback. That makes building environments much more complex than a static data set.
Some environments are quite robust, so that AI agents can use tools, gain access to the internet or use different software applications to complete a certain task. Others are more narrow, aimed at helping an agent to learn specific tasks in Enterprise Software applications.
Although RL environments are now hot in Silicon Valley, there is much precedent for using this technique. One of OpenAi’s first projects in 2016 was building “RL Gyms“They were quite similar to the modern view of environments. In the same year, Google trained DeepMind Alfo – An AI system that could beat a world champion at the board game, Go – using RL techniques within a simulated environment.
What is unique about today’s environments is that researchers try to build computer-usual AI agents with large transformation models. In contrast to Alphago, a specialized AI system that worked in closed environments, today’s AI agents have been trained to have more general options. AI researchers nowadays have a stronger starting point, but also a complicated goal that can go wrong.
A busy field
AI databeals that try Scale AI, Surge and Mercor try to meet the moment and build RL environments. These companies have more resources than many startups in space, as well as deep relationships with AI Labs.
Surge CEO Edwin Chen tells WAN that he recently saw a “significant increase” of the demand for RL environments within AI Labs. Surge – who is reportedly generated $ 1.2 billion in income Last year of working with AI Laboratories such as OpenAI, Google, Anthropic and Meta -recently spun a new internal organization specifically charged with building RL environments, he said.
Close behind Surge is Mercor, a startup with a value of $ 10 billion, who has also worked with OpenAi, Meta and Anthropic. Mercor Pitching Investors In the RL environments of the company for domain -specific tasks such as coding, health care and law, according to marketing material, seen by WAN.
Mercor CEO Brendan Foody told WAN in an interview that “few understand the chance of RL environments really.”
Scale AI dominated the space for data labeling, but has lost the land since Meta has invested $ 14 billion and hired his CEO. Since then, Google and OpenAi scale have dropped AI as a customer, and the startup is even confronted with competition for data labeling work within Meta. Yet scale tries to meet the moment and build environments.
“This is just the nature of the company [Scale AI] Is inside, “said Chetan Rane, scale AI’s head of product for agents and RL environments.” Scale has proven its ability to adjust quickly. We did this in the early days of autonomous vehicles, our first business unit. When Chatgpt came out, AI scale adapted to that. And now, again, we adapt to new border spaces such as agents and environments. “
Some newer players focus exclusively on environments from the start. Among them is Mechanize Work, a startup set up about six months ago with the daring goal of “automating all jobs”. Co-founder Matthew Barnett, however, tells WAN that his company starts with RL environments for AI coding agents.
Mechanize work is intended to provide AI laboratories with a small number of robust RL environments, says Barnett, instead of larger data companies that create a wide range of simple RL environments. Until now, the Startup Software offers $ 500,000 salaries To build RL environments – much higher than a contracting party per hour could work on scale AI or Surge.
Mechanize work has already worked with anthropic on RL environments, two sources that are familiar with the issue told WAN. Mechanize work and anthropic refused to comment on the partnership.
Other startups bet that RL environments will influence AI Labs. Prime Intellect – A startup supported by AI researcher Andrej Karpathy, Founders Fund and Menlo Ventures – focuses on smaller developers with his RL environments.
Last month, Prime Intellect launched one RL -Hub environments, The aim is to be a “hugging face for RL environments.” The idea is to give open-source developers access to the same sources that have large AI laboratories and that developers sell access to computational sources in the process.
Training generally capable of RL environments can be more computational more expensive than previous AI training techniques, according to Prime Intellect Researcher Will Brown. In addition to startups that build RL environments, there is also a chance for GPU providers who can feed the process.
“RL environments will be too big for a company to dominate,” said Brown in an interview. “A part of what we do is just trying to build a good open-source infrastructure around it. The service we sell has been calculated, so it is a handy on-radiation to use GPUs, but we are thinking more about this in the long term.”
Will it be scaled?
The open demand for RL environments is whether the technology will scale like earlier AI training methods.
In the past year, reinforcement education has driven some of the greatest jumps in AI, including models such as OpenAi’s O1 and Claude Opus 4 from Anthropic. These are particularly important breakthroughs because the methods that were previously used to improve AI models now show decreasing returns.
Envisors are part of the larger bet of AI Labs on RL, which, according to many, will continue to stimulate progress as they add more data and computational sources to the process. Some of the OpenAI researchers behind O1 previously told WAN that the company was originally invested in AI-reasoning models that were created by investments in RL and test-time computer-because they thought it would be neatly scaled.
The best way to scale RL remains unclear, but environments seem to be a promising competition. Instead of just rewarding chatbots for text reactions, they let agents in simulations work with tools and computers at their disposal. That is much more resource-intensive, but possibly more rewarding.
Some are skeptical that all these RL environments will come true. Ross Taylor, a former AI research leader with meta who co-founder of general reasoning, says WAN that RL environments are susceptible to rewarding hacking. This is a process in which AI models cheat to get a reward without really doing the task.
“I think people underestimate how difficult it is to scale environments,” said Taylor. “Even the best public available [RL environments] Usually does not work without a serious change. “
OpenAi’s head Engineering for his API company, Sherwin Wu, said in a Recent podcast That he was “short” on the startups of the RL environment. Wu noted that it is a very competitive space, but also that AI research evolves so quickly that it is difficult to serve AI Labs well.
Karpathy, an investor in Prime Intellect who has called RL environments a possible breakthrough, also has caution for the RL room more broadly expressed. In one Post on XHe expressed his concern about how much more AI preliminary output can be pressed from RL.
“I am a bullish about environments and agent interactions, but I am specifically bearish on learning strengthening,” said Karpathy.




