CVApr 13

Scene Change Detection with Vision-Language Representation Learning

arXiv:2604.1140215.3h-index: 8
Predicted impact top 40% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For urban monitoring and navigation, this work addresses the limitation of single-modal visual features in scene change detection by incorporating semantic reasoning through language, enabling more robust detection under challenging conditions.

LangSCD integrates vision-language models to generate textual descriptions of scene changes, fusing them with visual features via a cross-modal enhancer and a geometric-semantic matching module, achieving state-of-the-art performance on multiple street-view benchmarks. The method also introduces NYC-CD, a large-scale dataset with multiclass change annotations.

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes