Are AI agents ready for the workplace? A new benchmark raises doubts.

January 22, 2026

0 3 minutes read

It’s been almost two years since Microsoft CEO Satya Nadella predicted this AI would replace knowledge work – the white-collar jobs of lawyers, investment bankers, librarians, accountants, IT and others.

But despite the tremendous progress made by the basic models, change in knowledge work has been slow to come about. Models have mastered deep research and agentic planning, but for whatever reason most white-collar work has remained relatively untouched.

It’s one of the biggest mysteries in AI – and thanks to new research from training data giant Mercor, we’re finally getting answers.

The new research looks at how leading AI models hold up when performing actual white-collar work tasks, drawn from consulting, investment banking and law. The result is a new benchmark called Apex Agents – and so far every AI lab has received a failing grade. Faced with questions from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time the model returned with an incorrect answer or no answer at all.

According to researcher Brendan Foody, who worked on the paper, the models’ biggest stumbling block was discovering information across multiple domains – something that is integral to most knowledge work done by humans.

“One of the big changes in this benchmark is that we’ve built out the entire environment, modeled after real professional services,” Foody told WAN. “The way we do our work isn’t one person giving us all the context in one place. In real life, you’re working through Slack and Google Drive and all these other tools.” For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.

The scenarios all come from real professionals in Mercor’s expert marketplace, who both explain the questions and set the standard for a successful answer. Looking through the questions, which are posted publicly on Hugging Facegives an idea of how complex the tasks can be.

WAN event

San Francisco
|
October 13-15, 2026

One question in the ‘Law’ section asks:

During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the US analytics vendor… Under Northstar’s own policies, can it reasonably treat the one or two log exports as compliant with Article 49?

The right answer is yes, but getting there will require an in-depth assessment of the company’s own policies and relevant EU privacy laws.

That might surprise even a well-informed person, but the researchers tried to model the work of professionals in the field. If an LLM can reliably answer these questions, it could effectively replace many of the attorneys working today. “I think this is probably the most important topic in economics,” Foody told TechCrunch. “The benchmark largely reflects the real work these people do.”

OpenAI also tried to measure professional skills its GDPVal benchmark – but the Apex Agents test differs in important ways. While GDPVal tests general knowledge across a wide range of professions, the Apex Agents benchmark measures the system’s ability to perform sustainable tasks in a limited number of high-value professions. The result is more difficult for models, but also more closely related to whether these tasks can be automated.

While none of the models appeared ready to take over the role of investment banker, some were clearly closer. Gemini 3 Flash performed the best of the group with a one-shot accuracy of 24%, closely followed by GPT-5.2 at 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored around 18%.

Although early results are lacking, the AI field has a history of blowing through challenging benchmarks. Now that the Apex test is public, it’s an open challenge for AI labs that think they can do better – something Foody fully expects in the coming months.

“It’s improving very quickly,” he told TechCrunch. “Right now it’s fair to say it’s an intern who gets it right a quarter of the time, but last year it was the intern who got it right five to 10 percent of the time. That kind of improvement, year over year, can make an impact so quickly.”

]

Source link

Are AI agents ready for the workplace? A new benchmark raises doubts.

At least six dead after tornadoes sweep across Michigan and Oklahoma, officials say

Kristi Noem’s in-laws, Hope husband Bryon, are finally leaving her amid rumors

Senators denounce Trump administration’s inaction in reviewing Paramount’s Warner Bros. deal over national security risks due to support from Arab wealth funds

Brooklyn Beckham wants to divorce parents David and Victoria

Nicki Minaj complies with judgment in civil case and avoids forced sale of house

Related Articles

Mira Murati’s AI startup is reportedly aiming for a massive $2B seed round

ChatGPT launched three years ago today

Anthropic’s CEO stuns Davos with Nvidia criticism

Trump administration wants tech companies to buy $15B of power plants they may not use

At least six dead after tornadoes sweep across Michigan and Oklahoma, officials say

Kristi Noem’s in-laws, Hope husband Bryon, are finally leaving her amid rumors

Senators denounce Trump administration’s inaction in reviewing Paramount’s Warner Bros. deal over national security risks due to support from Arab wealth funds