CLAILGJan 20

APEX-Agents

arXiv:2601.14242v16 citationsh-index: 28Has Code
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for assessing AI agents in professional domains like investment banking and consulting, though it is incremental as it builds on existing agent evaluation methods.

The paper tackles the problem of evaluating AI agents on long-horizon, cross-application tasks by introducing the APEX-Agents benchmark, which tests agents in realistic work environments, with Gemini 3 Flash achieving the highest score of 24.0%.

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes