CVMar 29, 2024

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, Conghui He

arXiv:2403.20213v430.390 citationsh-index: 14Has CodeAAAI

Originality Incremental advance

AI Analysis

This work addresses the need for more detailed and honest remote sensing image analysis, which is crucial for applications in fields like environmental monitoring and urban planning, though it is incremental in improving dataset quality and model honesty.

The paper tackled the problem of limited and simplistic captioning in remote sensing image analysis by developing VHM, a vision language model trained on a comprehensive dataset (VersaD) and an honest instruction dataset (HnstD), which significantly outperformed existing models on tasks like scene classification and visual question answering, achieving competent performance on new tasks such as building vectorizing.

This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at https://github.com/opendatalab/VHM .

View on arXiv PDF Code

Similar