Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
It addresses segmentation challenges in remote sensing for applications like environmental monitoring, but is incremental as it builds on existing CLIP and SAM models.
The paper tackled the problem of open-vocabulary semantic segmentation in remote sensing by proposing ReSeg-CLIP, a training-free method that uses hierarchical attention masking with SAM and model composition with weighted averaging, achieving state-of-the-art results on three benchmarks.
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.