A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
This work addresses practical deployment challenges for remote sensing image captioning, though it is incremental as it builds on existing transformer and knowledge distillation techniques.
The paper tackled the high computational costs and lack of fine-grained feature capture in transformer-based remote sensing image captioning by proposing a lightweight transformer with edge-aware fusion, resulting in significant improvements in caption quality compared to state-of-the-art methods.
Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.