DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales
It addresses misinformation detection for multimodal content, offering an incremental improvement by refining rationale generation with filtering.
The paper tackles the problem of limited effectiveness in multimodal misinformation detection due to insufficient diversity, factual inaccuracies, and irrelevant content in generated rationales from large vision-language models, introducing DiFaR to enhance detection by up to 8.7% on benchmarks.
Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.