AI LGMay 8

Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

Jinhao Jing, Qiannian Zhao, Chao Huang, Zhan Su

arXiv:2605.0748861.6

AI Analysis

For practitioners training large multimodal models, OST offers a computationally efficient and interpretable data selection method that outperforms existing approaches, addressing the quality-quantity trade-off in synthetic data.

The paper proposes One-Step-Train (OST), a framework for efficient data selection in multimodal models that estimates marginal utility via simulated single-step updates. On Qwen models, OST with top-50 subset reduces training costs by 43% and surpasses LLM-as-a-Judge by 1.8 points, while top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge and 8.8 points over Full-SFT.

The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.

View on arXiv PDF

Similar