CVApr 3

MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

arXiv:2604.0307248.91 citations
AI Analysis

This work addresses efficiency issues in MLLMs for applications requiring fast inference, though it is incremental as it builds on existing pruning techniques.

The paper tackles the problem of inefficient inference in multimodal large language models (MLLMs) by pruning visual tokens, and it shows that their method outperforms previous attention-based pruning approaches with minimal latency.

For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes