CVAIMay 11

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

arXiv:2605.1076937.2
AI Analysis

For remote sensing segmentation tasks, MPerS improves multimodal fusion by generating high-quality captions and adaptively integrating textual semantics, but the approach is incremental as it builds on existing MLLMs and fusion techniques.

MPerS addresses the neglect of high-quality caption generation and its effectiveness in multimodal fusion for remote sensing scene segmentation. By using multiple MLLMs (LLaVA, ChatGPT, Qwen) to generate diverse captions and a Dynamic MixExperts module, it achieves superior performance on three public RS datasets.

The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes