AI CLMay 14

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack

arXiv:2605.1486558.1

AI Analysis

For researchers and developers of AI agents, this framework provides a scalable and interpretable method to diagnose failures in complex multi-step processes, outperforming existing evaluators significantly.

The authors present a holistic evaluation framework for AI agents that decomposes analysis into independent per-span assessments, achieving state-of-the-art results on GAIA and SWE-Bench with relative gains up to 38% on category F1, 3.5x on localization accuracy, and 12.5x on joint localization-categorization accuracy. They show that evaluation methodology, not model capability, is the bottleneck.

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

View on arXiv PDF

Similar