Scaling Capability in Token Space: An Analysis of Large Vision Language Model
This work provides a theoretical framework for understanding scaling in vision-language models, which is incremental as it extends known scaling concepts from language models to multimodal contexts.
The study investigated whether vision-language models exhibit predictable scaling behaviors with respect to the number of vision tokens, similar to large language models, and found that model performance aligns with a theoretical scaling relationship characterized by sublinear and linear regimes.
Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{α(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations.