CVJan 17, 2025

HiMix: Reducing Computational Complexity in Large Vision-Language Models

arXiv:2501.10318v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses efficiency issues for practical deployment of LVLMs, representing an incremental improvement by optimizing existing architectures.

The paper tackles the high computational complexity in Large Vision-Language Models (LVLMs) by identifying redundant vision sequences as a bottleneck and proposes HiMix, a hierarchical interaction mechanism that reduces language decoder computational cost by 10x while maintaining comparable performance.

Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: https://xuange923.github.io/HiMix

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes