Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
This work addresses inference efficiency for VLMs, which is crucial for real-world deployment, but it is incremental as it builds on existing token compression methods by optimizing the trade-off rather than introducing a fundamentally new paradigm.
The paper tackles the high inference latency in Vision Language Models (VLMs) by establishing scaling laws to find the optimal trade-off between visual token count and LLM parameters under fixed compute budgets, revealing that inference-optimal VLMs should use the largest possible LLM with minimal visual tokens (often just one). This approach enables high token compression ratios (beyond typical 5-10× reductions) while maintaining performance, as demonstrated through tailored prompt-based compression algorithms.
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their real-world deployment is often constrained by high latency during inference due to the substantial compute required by the LLM to process the large number of input tokens, predominantly arising from the image. To reduce inference costs, one can either downsize the LLM or reduce the number of input tokens needed to represent the image, the latter of which has been the focus of many recent efforts around token compression. However, it is unclear what the optimal trade-off is given a fixed inference budget. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take the first steps toward designing token compression algorithms tailored for high-compression settings, utilizing prompt-based compression of tokens. Our work underscores the performance and efficiency benefits of operating in low visual token regimes and the importance of developing tailored token reduction algorithms for such conditions. Code is available at https://github.com/locuslab/llava-token-compression.