CVAILGDec 20, 2023

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

arXiv:2312.12856v1176 citationsh-index: 6AAAI
Originality Highly original
AI Analysis

This addresses the problem of limited data for developing versatile vision-language models in remote sensing, which is crucial for applications like climate change and sustainable development, and is a foundational contribution rather than incremental.

The authors tackled the lack of a large-scale, semantically diverse vision-language dataset for remote sensing images by constructing SkyScript, comprising 2.6 million image-text pairs, and achieved a 6.2% average accuracy gain in zero-shot scene classification across seven benchmarks with their trained model.

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes