CVAISep 27, 2025

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

arXiv:2510.00040v11 citationsh-index: 2
Originality Highly original
AI Analysis

This addresses the problem of inefficient and uncontrolled instruction tuning for vision-language models, proposing a novel paradigm that could broadly impact model training.

The paper tackles the difficulty of controlling vision-language models through instruction tuning by introducing Capability-Attributed Data Curation (CADC), which uses unsupervised analysis of intrinsic capabilities to curate data, achieving better performance on multimodal benchmarks with only 5% of the original data.

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes