RISAM: Referring Image Segmentation via Mutual-Aware Attention Features
This work addresses the challenge of accurately segmenting image regions based on language prompts for applications in computer vision and human-computer interaction, representing an incremental improvement over prior methods.
The paper tackles the problem of referring image segmentation, where existing methods often incorrectly segment visually salient entities instead of the correct region due to visual dominance in multi-modal features, and proposes MARIS with a mutual-aware attention mechanism to enhance cross-modal fusion, achieving state-of-the-art performance on three benchmark datasets.
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.