CVMMDec 17, 2024

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

arXiv:2412.12718v216 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately identifying manipulated content in images and text for applications in media verification and security, representing an incremental improvement over prior methods.

The paper tackles the problem of detecting and grounding multi-modal media manipulation by advancing cross-modal semantic alignment, resulting in a model that surpasses existing methods by a clear margin on the DGM4 dataset.

We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model's ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes