CVMMApr 8, 2025

Latent Multimodal Reconstruction for Misinformation Detection

arXiv:2504.06010v21 citationsh-index: 16Has Code
Originality Highly original
AI Analysis

This work addresses the challenge of limited and simplistic training data for fact-checkers in detecting multimodal misinformation, offering improved generalization to real-world cases.

The paper tackles the problem of detecting multimodal misinformation, such as miscaptioned images, by introducing a new dataset generated with Large Vision-Language Models and a reconstruction-based network, achieving state-of-the-art results on benchmarks like NewsCLIPpings and VERITE.

Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have focused on developing datasets and methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic examples that lack real-world complexity, limiting model robustness. Meanwhile, Large Vision-Language Models (LVLMs) remain underexplored for generating diverse and realistic synthetic data for MMD. To address, we introduce "Miscaption This!", a collection of LVLM-generated miscaptioned image datasets. Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" generalize better to real-world misinformation while LAMAR achieves new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the value of LVLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes