CVMar 14, 2025

Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation

arXiv:2503.11183v114 citationsh-index: 3Has CodeIEEE Geoscience and Remote Sensing Letters

Originality Incremental advance

AI Analysis

This work addresses segmentation of objects in remote sensing images based on text descriptions, which is important for practical applications in remote sensing, but it appears incremental as it builds on existing multimodal fusion approaches.

The paper tackles the problem of referring remote sensing image segmentation by proposing a multimodal-aware fusion network (MAFN) to improve alignment between visual and text modalities, achieving state-of-the-art results on RRSIS-D datasets with significant effectiveness.

Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.

View on arXiv PDF Code

Similar