CVJul 18, 2024Code
EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote SensingWei Zhang, Miaoxin Cai, Tong Zhang et al.
Recent advances in prompt learning have allowed users to interact with artificial intelligence (AI) tools in multi-turn dialogue, enabling an interactive understanding of images. However, it is difficult and inefficient to deliver information in complicated remote sensing (RS) scenarios using plain language instructions alone, which would severely hinder deep comprehension of the latent content in imagery. Besides, existing prompting strategies in natural scenes are hard to apply to interpret the RS data due to significant domain differences. To address these challenges, the first visual prompting-based multi-modal large language model (MLLM) named EarthMarker is proposed in the RS domain. EarthMarker is capable of interpreting RS imagery at the image, region, and point levels by levering visual prompts (i.e., boxes and points). Specifically, a shared visual encoding method is developed to establish the spatial pattern interpretation relationships between the multi-scale representations of input images and various visual prompts. Subsequently, the mixed visual-spatial representations are associated with language instructions to construct joint prompts, enabling the interpretation of intricate content of RS imagery. Furthermore, to bridge the domain gap between natural and RS data, and effectively transfer domain-level knowledge from natural scenes to the RS domain, a cross-domain learning strategy is developed to facilitate the RS imagery understanding. In addition, to tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal multi-granularity visual prompts instruction-following is constructed. Our code and dataset are available at https://github.com/wivizhang/EarthMarker.
CVJul 8, 2022
Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing DomainTong Zhang, Peng Gao, Hao Dong et al.
Currently, under supervised learning, a model pretrained by a large-scale nature scene dataset and then fine-tuned on a few specific task labeling data is the paradigm that has dominated the knowledge transfer learning. It has reached the status of consensus solution for task-aware model training in remote sensing domain (RSD). Unfortunately, due to different categories of imaging data and stiff challenges of data annotation, there is not a large enough and uniform remote sensing dataset to support large-scale pretraining in RSD. Moreover, pretraining models on large-scale nature scene datasets by supervised learning and then directly fine-tuning on diverse downstream tasks seems to be a crude method, which is easily affected by inevitable labeling noise, severe domain gaps and task-aware discrepancies. Thus, in this paper, considering the self-supervised pretraining and powerful vision transformer (ViT) architecture, a concise and effective knowledge transfer learning strategy called ConSecutive PreTraining (CSPT) is proposed based on the idea of not stopping pretraining in natural language processing (NLP), which can gradually bridge the domain gap and transfer knowledge from the nature scene domain to the RSD. The proposed CSPT also can release the huge potential of unlabeled data for task-aware model training. Finally, extensive experiments are carried out on twelve datasets in RSD involving three types of downstream tasks (e.g., scene classification, object detection and land cover classification) and two types of imaging data (e.g., optical and SAR). The results show that by utilizing the proposed CSPT for task-aware model training, almost all downstream tasks in RSD can outperform the previous method of supervised pretraining-then-fine-tuning and even surpass the state-of-the-art (SOTA) performance without any expensive labeling consumption and careful model design.
CVJan 30, 2024
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing DomainWei Zhang, Miaoxin Cai, Tong Zhang et al.
Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks.
CVJan 22, 2022Code
Linear Array Network for Low-light Image EnhancementKeqi Wang, Ziteng Cui, Jieru Jia et al.
Convolution neural networks (CNNs) based methods have dominated the low-light image enhancement tasks due to their outstanding performance. However, the convolution operation is based on a local sliding window mechanism, which is difficult to construct the long-range dependencies of the feature maps. Meanwhile, the self-attention based global relationship aggregation methods have been widely used in computer vision, but these methods are difficult to handle high-resolution images because of the high computational complexity. To solve this problem, this paper proposes a Linear Array Self-attention (LASA) mechanism, which uses only two 2-D feature encodings to construct 3-D global weights and then refines feature maps generated by convolution layers. Based on LASA, Linear Array Network (LAN) is proposed, which is superior to the existing state-of-the-art (SOTA) methods in both RGB and RAW based low-light enhancement tasks with a smaller amount of parameters. The code is released in https://github.com/cuiziteng/LASA_enhancement.
CVMar 6, 2024
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing ImageryWei Zhang, Miaoxin Cai, Tong Zhang et al.
Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the large language models (LLMs)'s powerful generalization ability, a unified visual-language model called Popeye is proposed for multi-source ship detection from RS imagery. Specifically, to bridge the interpretation gap between the multi-source images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box (HBB) and oriented bounding box (OBB). Subsequently, the hybrid experts encoder is designed to refine multi-scale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pre-trained visual-language knowledge from the nature scene into the RS domain for multi-source ship detection. In addition, the segment anything model (SAM) is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models for zero-shot multi-source ship detection.
CVJun 14, 2025
UniDet-D: A Unified Dynamic Spectral Attention Model for Object Detection under Adverse WeathersWei Zhang, Yuantao Wang, Haowei Yang et al.
Real-world object detection is a challenging task where the captured images/videos often suffer from complex degradations due to various adverse weather conditions such as rain, fog, snow, low-light, etc. Despite extensive prior efforts, most existing methods are designed for one specific type of adverse weather with constraints of poor generalization, under-utilization of visual features while handling various image degradations. Leveraging a theoretical analysis on how critical visual details are lost in adverse-weather images, we design UniDet-D, a unified framework that tackles the challenge of object detection under various adverse weather conditions, and achieves object detection and image restoration within a single network. Specifically, the proposed UniDet-D incorporates a dynamic spectral attention mechanism that adaptively emphasizes informative spectral components while suppressing irrelevant ones, enabling more robust and discriminative feature representation across various degradation types. Extensive experiments show that UniDet-D achieves superior detection accuracy across different types of adverse-weather degradation. Furthermore, UniDet-D demonstrates superior generalization towards unseen adverse weather conditions such as sandstorms and rain-fog mixtures, highlighting its great potential for real-world deployment.
CVApr 17, 2025
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual PromptingWei Zhang, Miaoxin Cai, Yaqian Ning et al.
Recent advances in natural-domain multi-modal large language models (MLLMs) have demonstrated effective spatial reasoning through visual and textual prompting. However, their direct transfer to remote sensing (RS) is hindered by heterogeneous sensing physics, diverse modalities, and unique spatial scales. Existing RS MLLMs are mainly limited to optical imagery and plain language interaction, preventing flexible and scalable real-world applications. In this article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multi-source RS imagery comprehension and accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework. Distinct from prior models, EarthGPT-X introduces: 1) a dual-prompt mechanism combining text instructions with various visual prompts (i.e., point, box, and free-form) to mimic the versatility of referring in human life; 2) a comprehensive multi-source multi-level prompting dataset, the model advances beyond holistic image understanding to support hierarchical spatial reasoning, including scene-level understanding and fine-grained object attributes and relational analysis; 3) a cross-domain one-stage fusion training strategy, enabling efficient and consistent alignment across modalities and tasks. Extensive experiments demonstrate that EarthGPT-X substantially outperforms prior nature and RS MLLMs, establishing the first framework capable of multi-source, multi-task, and multi-level interpretation using visual prompting in RS scenarios.