AI

Are AI agents ready for the workplace? A new benchmark raises doubts.

It’s been almost two years since Microsoft CEO Satya Nadella predicted this AI would replace knowledge work – the white-collar jobs of lawyers, investment bankers, librarians, accountants, IT and others.

But despite the tremendous progress made by the basic models, change in knowledge work has been slow to come about. Models have mastered deep research and agentic planning, but for whatever reason most white-collar work has remained relatively untouched.

It’s one of the biggest mysteries in AI – and thanks to new research from training data giant Mercor, we’re finally getting answers.

The new research looks at how leading AI models hold up when performing actual white-collar work tasks, drawn from consulting, investment banking and law. The result is a new benchmark called Apex Agents – and so far every AI lab has received a failing grade. Faced with questions from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time the model returned with an incorrect answer or no answer at all.

According to researcher Brendan Foody, who worked on the paper, the models’ biggest stumbling block was discovering information across multiple domains – something that is integral to most knowledge work done by humans.

“One of the big changes in this benchmark is that we’ve built out the entire environment, modeled after real professional services,” Foody told WAN. “The way we do our work isn’t one person giving us all the context in one place. In real life, you’re working through Slack and Google Drive and all these other tools.” For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.

See also  Your AI tools run on fracked gas and bulldozed Texas land
Screenshot

The scenarios all come from real professionals in Mercor’s expert marketplace, who both explain the questions and set the standard for a successful answer. Looking through the questions, which are posted publicly on Hugging Facegives an idea of ​​how complex the tasks can be.

WAN event

San Francisco
|
October 13-15, 2026

One question in the ‘Law’ section asks:

During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the US analytics vendor… Under Northstar’s own policies, can it reasonably treat the one or two log exports as compliant with Article 49?

The right answer is yes, but getting there will require an in-depth assessment of the company’s own policies and relevant EU privacy laws.

That might surprise even a well-informed person, but the researchers tried to model the work of professionals in the field. If an LLM can reliably answer these questions, it could effectively replace many of the attorneys working today. “I think this is probably the most important topic in economics,” Foody told TechCrunch. “The benchmark largely reflects the real work these people do.”

OpenAI also tried to measure professional skills its GDPVal benchmark – but the Apex Agents test differs in important ways. While GDPVal tests general knowledge across a wide range of professions, the Apex Agents benchmark measures the system’s ability to perform sustainable tasks in a limited number of high-value professions. The result is more difficult for models, but also more closely related to whether these tasks can be automated.

See also  Trump says federal agents should search for Savannah Guthrie's mother

While none of the models appeared ready to take over the role of investment banker, some were clearly closer. Gemini 3 Flash performed the best of the group with a one-shot accuracy of 24%, closely followed by GPT-5.2 at 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored around 18%.

Although early results are lacking, the AI ​​field has a history of blowing through challenging benchmarks. Now that the Apex test is public, it’s an open challenge for AI labs that think they can do better – something Foody fully expects in the coming months.

“It’s improving very quickly,” he told TechCrunch. “Right now it’s fair to say it’s an intern who gets it right a quarter of the time, but last year it was the intern who got it right five to 10 percent of the time. That kind of improvement, year over year, can make an impact so quickly.”

]

Source link

Back to top button