CLIRMMJan 20, 2025

Verifying Cross-modal Entity Consistency in News using Vision-language Models

arXiv:2501.11403v24 citationsh-index: 10Has CodeECIR
AI Analysis

This addresses the challenge of cross-modal disinformation detection for news consumers and platforms, though it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of detecting disinformation in news by verifying the consistency of entities like persons, locations, and events across images and text, using a large vision-language model framework that shows improved accuracy for persons and events and outperforms baselines for locations and events.

The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes