CVLGAug 14, 2025

Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

arXiv:2508.10339v1h-index: 14
Originality Incremental advance
AI Analysis

This work addresses the trade-off in instruction tuning for multi-modal models, offering a practical method to enhance benchmark-specific performance, though it is incremental in nature.

The paper tackles the problem of optimizing instruction selection for vision-language models by identifying that benchmarks benefit from either similar visual concepts or skills, and proposes a targeted data selection method that improves performance by +0.9% on average and +1.5% on skill-focused benchmarks.

Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes