AI Models Fall Short on New Professional Benchmark, Researchers Find

Key Points

APEX‑Agents benchmark tests AI on real consulting, banking, and legal tasks.
All evaluated models fail to meet the benchmark, with top accuracy around 24%.
Multi‑domain reasoning across tools like Slack and Google Drive is a major weakness.
Current AI performance is likened to an intern getting the right answer about a quarter of the time.
The benchmark is publicly available, encouraging further research and improvement.

Background

Nearly two years after a major tech CEO predicted that artificial intelligence would replace many knowledge‑work jobs, progress has been slower than expected. While large language models have advanced in research and planning capabilities, their impact on professions such as consulting, investment banking, and law remains limited.

Introducing the APEX‑Agents Benchmark

To evaluate AI readiness for professional tasks, Mercur researchers created a benchmark named APEX‑Agents. The test draws real queries from experts on the company’s marketplace and measures how well AI systems can handle sustained, domain‑specific work. Scenarios are modeled after actual professional environments, requiring navigation across multiple platforms and data sources.

Performance Results

The benchmark results show that all evaluated AI models receive failing grades. Even the best‑performing system, Gemini 3 Flash, achieves only 24% one‑shot accuracy, while GPT‑5.2 scores 23%. Other models hover around 18% accuracy. In most cases, the models either provide incorrect answers or no answer at all, indicating a significant gap between current AI capabilities and the demands of high‑value professional tasks.

Key Challenges Identified

Researchers pinpointed multi‑domain reasoning as the biggest stumbling block. Professionals typically work across tools such as Slack, Google Drive, and other internal systems, and the AI models struggled to retrieve and synthesize information spread across these environments. This limitation hampers the models’ ability to perform tasks that require comprehensive context and cross‑referencing of data.

Implications for the Future of Work

The findings suggest that, for now, AI systems are comparable to interns who get the right answer roughly a quarter of the time. However, researchers note that progress has been rapid, with current performance representing a notable improvement over previous years. The public release of the APEX‑Agents benchmark invites AI labs to develop better solutions, potentially accelerating advancements toward more capable professional assistants.

Source: techcrunch.com