CVNov 18, 2024

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Zhiming Wang, Mingze Wang, Sheng Xu, Yanjing Li, Baochang Zhang

arXiv:2411.11360v111.319 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses remote sensing change captioning for applications like environmental monitoring, but it is incremental as it builds on existing multimodal large language models with specific enhancements.

The paper tackled the problem of generating natural language descriptions of changes in remote sensing images by proposing CCExpert, a model that integrates difference-aware features and uses a large dataset, achieving a state-of-the-art score of 81.80 on the LEVIR-CC benchmark.

Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of surface changes between multi-temporal remote sensing images, detailing the categories, locations, and dynamics of changed objects (e.g., additions or disappearances). Many current methods attempt to leverage the long-sequence understanding and reasoning capabilities of multimodal large language models (MLLMs) for this task. However, without comprehensive data support, these approaches often alter the essential feature transmission pathways of MLLMs, disrupting the intrinsic knowledge within the models and limiting their potential in RSICC. In this paper, we propose a novel model, CCExpert, based on a new, advanced multimodal large model framework. Firstly, we design a difference-aware integration module to capture multi-scale differences between bi-temporal images and incorporate them into the original image context, thereby enhancing the signal-to-noise ratio of differential features. Secondly, we constructed a high-quality, diversified dataset called CC-Foundation, containing 200,000 image pairs and 1.2 million captions, to provide substantial data support for continue pretraining in this domain. Lastly, we employed a three-stage progressive training process to ensure the deep integration of the difference-aware integration module with the pretrained MLLM. CCExpert achieved a notable performance of $S^*_m=81.80$ on the LEVIR-CC benchmark, significantly surpassing previous state-of-the-art methods. The code and part of the dataset will soon be open-sourced at https://github.com/Meize0729/CCExpert.

View on arXiv PDF Code

Similar