TRAIL: Trace Reasoning and Agentic Issue Localization
This addresses the need for scalable evaluation methods for agentic systems, which is crucial for developers and researchers working with AI agents, though it is incremental as it builds on existing benchmarks.
The paper tackles the problem of evaluating complex traces generated by agentic workflows, which current manual methods do not scale for, and introduces a dataset (TRAIL) of 148 human-annotated traces, revealing that modern LLMs perform poorly with only 11% accuracy on trace debugging.
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.