CVSep 26, 2025

LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation

Yixiao Liu, Yizhou Yang, Jinwen Li, Jun Tao, Ruoyu Li, Xiangkun Wang, Min Zhu, Junlong Cheng

arXiv:2509.21894v1h-index: 7

Originality Incremental advance

AI Analysis

This work addresses the limitation of neglecting multimodal data in remote sensing change detection, offering a novel approach that enhances performance for applications like land cover monitoring, though it is incremental as it builds on existing foundational models like SAM2.

The paper tackles the problem of remote sensing change detection by introducing a language-guided model that uses text prompts to focus on regions of interest, resulting in improved accuracy and robustness, as demonstrated by outperforming state-of-the-art methods on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD).

Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network's attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.

View on arXiv PDF

Similar