CVMay 18

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang

arXiv:2605.1794983.4

Predicted impact top 23% in CV · last 90 daysOriginality Highly original

AI Analysis

For remote sensing vision-language models, this work addresses the problem of over-reliance on language priors in fine-grained spatial reasoning, offering a more reliable approach.

SkyNative introduces an encoder-free multimodal framework for remote sensing that directly represents images as raw patch tokens, avoiding premature compression of visual evidence. It achieves stronger image-grounded perception and improved robustness against language priors in fine-grained spatial reasoning tasks.

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

View on arXiv PDF

Similar