AIFeb 17, 2025

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

arXiv:2502.11753v25 citationsh-index: 11ACL
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of scaling fact-checking efforts for multimodal misinformation, though it is incremental as it focuses on dataset creation and benchmarking.

The authors tackled the problem of automating checkworthy claim detection for fact-checking by introducing HintsOfTruth, a dataset with 27K real and synthetic image/claim pairs, and found that lightweight text-based encoders perform comparably to multimodal models but are limited, while multimodal models are more robust with synthetic data but computationally costly.

Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers' efforts. However, detection methods struggle with content that is (1) multimodal, (2) from diverse domains, and (3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with 27K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the former only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for large-scale applications. When faced with synthetic data, multimodal models perform more robustly.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes