AIJun 1

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

arXiv:2606.0206088.1
AI Analysis

For developers and evaluators of deep-research agents, this work provides a process-level method to localize errors in long trajectories, addressing the limitation of final-answer-only evaluation.

The paper introduces TELBench, a benchmark for span-level error localization in deep-research agent trajectories, and proposes DRIFT, a claim-centric auditing framework that improves first-error accuracy by up to 30 percentage points.

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes