CVMar 2, 2025

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

arXiv:2503.00743v213 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses a domain-specific bottleneck in remote sensing AI by providing a systematic framework for data curation, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of lacking high-quality training data for remote sensing vision-language models by proposing a learned scoring model for automated quality assessment of synthetically generated data, and shows that fine-tuning with top-ranked data improves accuracy over full-data fine-tuning and CLIP-score-based methods.

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes