CL AI LGJan 20

APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia

arXiv:2601.14242v16 citationsh-index: 28Has Code

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for assessing AI agents in professional domains like investment banking and consulting, though it is incremental as it builds on existing agent evaluation methods.

The paper tackles the problem of evaluating AI agents on long-horizon, cross-application tasks by introducing the APEX-Agents benchmark, which tests agents in realistic work environments, with Gemini 3 Flash achieving the highest score of 24.0%.

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

View on arXiv PDF

Similar