Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach
This work addresses computational and semantic limitations in remote sensing change captioning, offering an incremental improvement for monitoring Earth's dynamics.
The paper tackled the problem of high computational demands and insufficient detail in change captioning for remote sensing data by proposing SAT-Cap, a single-stage transformer approach, which achieved CIDEr scores of 140.23% on LEVIR-CC and 97.74% on DUBAI-CC, surpassing state-of-the-art methods.
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.