CVAIDec 30, 2025

FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

arXiv:2512.24022v13 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of adapting VLMs to remote sensing for domain-specific applications, representing an incremental improvement with novel method elements.

The paper tackles the challenge of applying vision-language models to remote sensing by introducing MF-RSVLM, which fuses multi-scale visual features to improve fine-grained understanding and reduce visual forgetting, achieving state-of-the-art or competitive results on classification, captioning, and VQA tasks.

Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at https://github.com/Yunkaidang/RSVLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes