Zero-Shot Warning Generation for Misinformative Multimodal Content
This addresses the societal issue of deceptive misinformation for online audiences, though it appears incremental as it builds on existing detection methods by adding warning generation.
The paper tackles the problem of detecting out-of-context misinformation by pairing authentic images with false text, proposing a model that checks cross-modality consistency with minimal training time and a lightweight version using one-third of the parameters, while introducing a zero-shot task for generating contextualized warnings to aid debunking.
The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.