AIMar 10

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

arXiv:2603.09715v134.5h-index: 4
Predicted impact top 18% in AI · last 90 daysOriginality Highly original
AI Analysis

This work addresses the challenge of inefficient data selection for vision-language large models, offering a more effective and cost-efficient approach for researchers and practitioners in multimodal AI.

The paper tackles the problem of selecting high-quality samples for vision-language instruction tuning by proposing CVS, a training-free method that identifies samples requiring genuine cross-modal reasoning, resulting in performance gains of 3.5% and 4.8% on Vision-Flan with only 10% and 15% of data, respectively, and computational cost reductions of 17.3% and 44.4% compared to prior methods.

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes