AIFeb 17, 2025

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

Michiel van der Meer, Pavel Korshunov, Sébastien Marcel, Lonneke van der Plas

arXiv:2502.11753v211.15 citationsh-index: 60ACL

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of scaling fact-checking efforts for multimodal misinformation, though it is incremental as it focuses on dataset creation and benchmarking.

The authors tackled the problem of automating checkworthy claim detection for fact-checking by introducing HintsOfTruth, a dataset with 27K real and synthetic image/claim pairs, and found that lightweight text-based encoders perform comparably to multimodal models but are limited, while multimodal models are more robust with synthetic data but computationally costly.

Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers' efforts. However, detection methods struggle with content that is (1) multimodal, (2) from diverse domains, and (3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with 27K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the former only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for large-scale applications. When faced with synthetic data, multimodal models perform more robustly.

View on arXiv PDF

Similar