BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
This work addresses a domain-specific problem for remote sensing applications like urban development monitoring and disaster assessment, representing an incremental improvement over previous methods.
The paper tackles the problem of inadequate modeling of temporal correlations and spatial semantic changes in bi-temporal satellite imagery analysis by proposing BTCChat, a multi-temporal multimodal large language model with a Change Extraction module and Prompt Augmentation mechanism, achieving state-of-the-art performance on change captioning and visual question answering tasks.
Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.