AIMar 6

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

arXiv:2603.05912v119.42 citationsh-index: 39

Predicted impact top 14% in AI · last 90 daysOriginality Highly original

AI Analysis

This work is significant for researchers and developers working with LLM agents to produce deep research reports, as it provides a more reliable method for fact-checking and a benchmark for evaluating such systems, addressing the brittleness of static expert-labeled benchmarks.

This paper addresses the challenge of verifying claim-level factuality in Deep Research Reports (DRRs) generated by search-augmented LLM agents. They found that static expert-labeled benchmarks are unreliable, with unassisted experts achieving only 60.8% accuracy. By proposing Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels are revisable, expert micro-gold accuracy increased to 90.9%. They instantiated AtS as DeepFact-Bench and DeepFact-Eval, a verification agent that outperforms existing verifiers.

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

View on arXiv PDF

Similar