CVAIOct 20, 2025

ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

arXiv:2510.17197v1
Originality Incremental advance
AI Analysis

This addresses the inference cost problem for users of Vision-Language Models, offering an incremental improvement over existing pruning methods by incorporating prompt guidance.

The paper tackled the problem of high inference costs in Vision-Language Models due to visual token redundancy by proposing a zero-shot prompt-aware pruning method, achieving performance matching or surpassing state-of-the-art with minimal accuracy loss even when pruning up to 90% of tokens and reducing GPU memory and latency.

As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes