CVMar 25

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

arXiv:2603.2468078.2h-index: 3Has Code
AI Analysis

This addresses efficiency issues for users of multimodal LLMs by providing a plug-and-play solution that enhances performance without retraining, though it is incremental as it builds on existing pruning methods.

The paper tackles the computational expense of multimodal large language models by introducing ReDiPrune, a training-free token pruning method applied before the vision-language projector, which improves accuracy-efficiency trade-offs, such as achieving a +2.0% accuracy gain with 15% of visual tokens and reducing computation by over 6× in TFLOPs on EgoSchema with LLaVA-NeXT-Video-7B.

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes