CVNov 13, 2025

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang

arXiv:2511.10081v11 citationsh-index: 2

Originality Highly original

AI Analysis

This addresses efficiency issues for users of MLLMs by introducing a novel pruning strategy inspired by human attention, though it is incremental as it builds on existing token pruning techniques.

The paper tackles the computational overhead of visual tokens in multimodal large language models by proposing GridPrune, a two-stage pruning method that first allocates token budgets across spatial zones and then selects tokens locally, achieving 96.98% of full performance with 11.1% of tokens on LLaVA-NeXT-7B, outperforming baselines by 2.34%.

Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

View on arXiv PDF

Similar