Large Vision-Language Models Get Lost in Attention
For researchers and engineers optimizing LVLM architectures, this work exposes fundamental inefficiencies in attention mechanisms, suggesting that current models fail to effectively leverage visual context.
The paper proposes a unified information-theoretic and geometric framework to analyze residual updates in large vision-language models (LVLMs), revealing that attention acts as a subspace-preserving operator while FFNs drive semantic innovation. It further shows that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or superior performance, indicating severe redundancy in current attention mechanisms.
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.