CVAug 29, 2024

Law of Vision Representation in MLLMs

arXiv:2408.16357v317 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses computational efficiency for MLLM developers by reducing training costs, though it is incremental as it builds on existing vision representation methods.

The paper tackles the problem of optimizing vision representation in multimodal large language models (MLLMs) by discovering a linear correlation between cross-modal alignment and correspondence scores and model performance, enabling the identification of optimal vision representations with a 99.7% reduction in computational cost.

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes