Law of Vision Representation in MLLMs
This work addresses computational efficiency for MLLM developers by reducing training costs, though it is incremental as it builds on existing vision representation methods.
The paper tackles the problem of optimizing vision representation in multimodal large language models (MLLMs) by discovering a linear correlation between cross-modal alignment and correspondence scores and model performance, enabling the identification of optimal vision representations with a 99.7% reduction in computational cost.
We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.