CLAIJul 4, 2025

Recon, Answer, Verify: Agents in Search of Truth

arXiv:2507.03671v12 citationsh-index: 1EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of scalable and realistic fact-checking for applications in politics and other domains, though it is incremental as it builds on existing agent-based methods.

The authors tackled the problem of unrealistic evaluation in automated fact-checking by creating a new benchmark dataset (PFO) that removes post-claim analysis, leading to a 22% performance drop for LLMs. They proposed the RAV agentic framework, which improves state-of-the-art results by up to 25.28% on existing benchmarks and reduces the performance drop on PFO to 16.3%.

Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes