CVJan 29

Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

arXiv:2601.21159v1Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurate geometric localization and semantic prediction in remote sensing imagery, which is crucial for applications like land-cover analysis, but it appears incremental as it builds on existing CLIP and vision foundation models with enhanced fusion and refinement techniques.

The paper tackles the problem of open-vocabulary semantic segmentation in high-resolution remote sensing imagery, which involves densely distributed objects and complex boundaries, by proposing a training-free framework called SDCI that achieves better performance than existing approaches on multiple benchmarks.

High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using "one-way injection" and "shallow post-processing" strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.

View on arXiv PDF Code

Similar