CVMay 21, 2025

Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM

arXiv:2505.15816v11 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses computational bottlenecks for users of large multimodal models, offering an incremental improvement over existing token reduction methods.

The paper tackles the computational inefficiency of large multimodal models by identifying and reducing computation-level redundancy in visual token processing, proposing ProxyV which achieves efficiency gains without performance loss and sometimes even improves performance.

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes