CVJan 31, 2025

RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

arXiv:2501.19036v34 citationsh-index: 16Has CodeACL
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks for decoder-only MLLM users, though it appears incremental as it builds on existing architectures.

The paper tackles the performance-efficiency tradeoff in Multimodal Large Language Models by revealing substantial redundancy in visual token processing in decoder-only architectures, achieving comparable or better performance than state-of-the-art methods with a training-free inference acceleration approach.

Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs, offering valuable insights for future MLLM architecture design. Furthermore, by leveraging our reduction framework as a training-free inference acceleration approach, we achieve performance comparable to or better than state-of-the-art methods while remaining compatible with them. Code will be publicly available at https://github.com/L-Hugh/RedundancyLens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes