CLMay 7

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah, Corey Feld

arXiv:2605.0663576.41 citationsHas Code

Predicted impact top 80% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and developers of LLM-based research agents, this work provides a much-needed evaluation infrastructure to diagnose the disconnect between citation appearance and factual reliability.

The paper introduces the first framework for evaluating source attribution in LLM deep research agents, assessing citations along three dimensions: link validity, relevance, and factual accuracy. Benchmarking 14 LLMs reveals that even top models achieve only 39-77% factual accuracy, and accuracy drops by ~42% as retrieval scales from 2 to 150 tool calls.

Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

View on arXiv PDF

Similar