CVJan 24, 2025

Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

Madeline Anderson, Miriam Cha, William T. Freeman, J. Taylor Perron, Nathaniel Maidel, Kerri Cahoy

arXiv:2501.14905v16.22 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the limited adoption of vision-language models in remote sensing due to data scarcity, though it is incremental as it builds on existing synthetic caption generation methods.

The paper tackles the scarcity of paired image-text data in remote sensing by proposing a method to enhance vision-language datasets using maps as external data sources, which generates detailed captions and addresses hallucinations in LLM-generated text. They introduce the fMoW-mm dataset and demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other datasets.

Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.

View on arXiv PDF

Similar