CV AI LGDec 20, 2023

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, Ram Rajagopal

arXiv:2312.12856v131.7198 citationsh-index: 6Has CodeAAAI

Originality Highly original

AI Analysis

This addresses the problem of limited data for developing versatile vision-language models in remote sensing, which is crucial for applications like climate change and sustainable development, and is a foundational contribution rather than incremental.

The authors tackled the lack of a large-scale, semantically diverse vision-language dataset for remote sensing images by constructing SkyScript, comprising 2.6 million image-text pairs, and achieved a 6.2% average accuracy gain in zero-shot scene classification across seven benchmarks with their trained model.

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.

View on arXiv PDF Code

Similar