CVMar 2, 2025

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, Xueliang Zhang

arXiv:2503.00743v219.013 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses a domain-specific bottleneck in remote sensing AI by providing a systematic framework for data curation, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of lacking high-quality training data for remote sensing vision-language models by proposing a learned scoring model for automated quality assessment of synthetically generated data, and shows that fine-tuning with top-ranked data improves accuracy over full-data fine-tuning and CLIP-score-based methods.

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

View on arXiv PDF

Similar