CVApr 2, 2025

Image Difference Grounding with Natural Language

arXiv:2504.01952v1h-index: 7
Originality Incremental advance
AI Analysis

This addresses the need for precise vision-language perception in applications like automatic surveillance, though it is incremental as it builds on existing visual grounding and image difference understanding work.

The paper tackles the problem of localizing fine-grained visual differences between image pairs based on natural language instructions, introducing the Image Difference Grounding (IDG) task and a baseline model that achieves results on a new dataset.

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial. Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences. Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions. We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences. Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU. To foster future research, both DiffGround data and DiffTracker model will be publicly released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes