CVMar 18

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

arXiv:2603.1752879.0h-index: 19
Predicted impact top 29% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the limitation of existing methods that rely on clear-sky optical data, enabling more resilient segmentation for remote sensing applications, though it is incremental as it builds on multimodal fusion techniques.

The paper tackles the problem of open-vocabulary segmentation in remote sensing under adverse weather conditions like clouds and haze, achieving superior robustness and generalization across diverse cloud conditions through a multimodal Optical-SAR fusion framework.

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes