Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention
This work addresses image fusion for applications like surveillance or medical imaging, but it appears incremental as it builds on existing fusion models with hybrid architectures.
The authors tackled the problem of fusing images from heterogeneous sensors to enrich information and improve imaging quality by proposing a hybrid CNN-Transformer model with non-local cross-modal attention, achieving state-of-the-art results in qualitative and quantitative experiments.
The fusion of images taken by heterogeneous sensors helps to enrich the information and improve the quality of imaging. In this article, we present a hybrid model consisting of a convolutional encoder and a Transformer-based decoder to fuse multimodal images. In the encoder, a non-local cross-modal attention block is proposed to capture both local and global dependencies of multiple source images. A branch fusion module is designed to adaptively fuse the features of the two branches. We embed a Transformer module with linear complexity in the decoder to enhance the reconstruction capability of the proposed network. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method by comparing it with existing state-of-the-art fusion models. The source code of our work is available at https://github.com/pandayuanyu/HCFusion.