CL AISep 2, 2025

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu

Microsoft

arXiv:2509.04499v18 citationsh-index: 23

Originality Incremental advance

AI Analysis

This addresses the issue of trustworthiness in AI-driven research tools for users relying on synthesized information, though it is incremental as it builds on prior community-identified failure cases.

The paper tackled the problem of unreliable source attribution and overconfidence in generative search engines and deep research LLM agents by introducing DeepTRACE, an audit framework that revealed these systems often produce one-sided, highly confident responses with 40-80% citation accuracy and large fractions of unsupported statements.

Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.

View on arXiv PDF

Similar