CVSep 19, 2025

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

arXiv:2509.15704v21 citationsh-index: 11IEEE transactions on circuits and systems for video technology (Print)
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks for users of large vision-language models, though it is incremental as it builds on existing token pruning methods.

The paper tackles the problem of high computational cost in high-resolution large vision-language models by proposing Pyramid Token Pruning, a training-free strategy that reduces token count through saliency and instruction guidance, achieving substantial reductions in cost and latency with minimal performance loss.

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes