AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv:2605.2053080.2

Predicted impact top 35% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers of LLM agents, this work provides a more nuanced evaluation methodology that reveals the extent to which current benchmarks overestimate agent capabilities due to prompt engineering.

AgentAtlas addresses the fragmentation of LLM agent benchmarks by proposing a taxonomy-based evaluation framework that separates prompt supervision from true capability, finding that removing explicit label menus drops trajectory accuracy by 14-40 percentage points across models, with no single model excelling on all metrics.

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model's trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.

View on arXiv PDF

Similar