CVMar 18

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

arXiv:2603.1752879.0h-index: 19

Predicted impact top 29% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the limitation of existing methods that rely on clear-sky optical data, enabling more resilient segmentation for remote sensing applications, though it is incremental as it builds on multimodal fusion techniques.

The paper tackles the problem of open-vocabulary segmentation in remote sensing under adverse weather conditions like clouds and haze, achieving superior robustness and generalization across diverse cloud conditions through a multimodal Optical-SAR fusion framework.

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

View on arXiv PDF

Similar