CVMar 10, 2025

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

arXiv:2503.07266v33 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the problem of segmenting objects in remote sensing images based on text descriptions, which is an incremental improvement for domain-specific applications.

The paper tackles the challenge of adapting Segment Anything Model 2 (SAM2) for Referring Remote Sensing Image Segmentation (RRSIS) by aligning visual and textual features and generating pseudo-mask prompts, achieving state-of-the-art performance on multiple benchmarks.

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM2, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes