CVMay 27, 2025

CROP: Contextual Region-Oriented Visual Token Pruning

arXiv:2505.21233v210 citationsh-index: 10EMNLP
Originality Incremental advance
AI Analysis

This addresses computational inefficiency in vision-language models for VQA tasks, but it is incremental as it builds on existing pruning methods.

The paper tackles the problem of excessive visual tokens in VLM-based VQA methods by proposing CROP, a framework that compresses tokens through localization and pruning, achieving state-of-the-art performance on various VQA tasks.

Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes