CLAIJun 14, 2025

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

arXiv:2506.12538v19 citationsh-index: 9Has CodeMM
Originality Incremental advance
AI Analysis

This addresses the problem of inadequate evaluation for LLMs in real-world misinformation scenarios, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of realistic benchmarks for evaluating large language models in fact-checking by introducing RealFactBench, a comprehensive benchmark with 6K claims and a new Unknown Rate metric, revealing limitations in 7 LLMs and 4 MLLMs.

Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes