CVMay 18
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence ReasoningXiao Yang, Ronghao Fu, Zhiwen Lin et al.
Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.
CVMar 10
GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency LearningXiao Yang, Ronghao Fu, Zhuoran Duan et al.
Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
CVMar 10
GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process SupervisionLang Sun, Ronghao Fu, Zhuoran Duan et al.
While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
CVDec 2, 2025
SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of ExpertsJiaqi Liu, Ronghao Fu, Lang Sun et al.
The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.
NIMay 29, 2025
Agile Orchestration at Will: An Entire Smart Service-Based Security Architecture Towards 6GZhuoran Duan, Guoshun Nan, Rushan Li et al.
The upcoming 6G will fundamentally reshape mobile networks beyond communications, unlocking a multitude of applications that were once considered unimaginable. Meanwhile, security and resilience are especially highlighted in the 6G design principles. However, safeguarding 6G networks will be quite challenging due to various known and unknown threats from highly heterogeneous networks and diversified security requirements of distinct use cases, calling for a comprehensive re-design of security architecture. This motivates us to propose ES3A (Entire Smart Service-based Security Architecture), a novel security architecture for 6G networks. Specifically, we first discuss six high-level principles of our ES3A that include hierarchy, flexibility, scalability, resilience, endogeny, and trust and privacy. With these goals in mind, we then introduce three guidelines from a deployment perspective, envisioning our ES3A that offers service-based security, end-to-end protection, and smart security automation for 6G networks. Our architecture consists of three layers and three domains. It relies on a two-stage orchestration mechanism to tailor smart security strategies for customized protection in high-dynamic 6G networks, thereby addressing the aforementioned challenges. Finally, we prototype the proposed ES3A on a real-world radio system based on Software-Defined Radio (SDR). Experiments show the effectiveness of our ES3A. We also provide a case to show the superiority of our architecture.