CVAICLSep 8, 2025

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

arXiv:2509.06415v11 citations
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users of vision-language models in document processing applications, though it appears incremental as it builds on existing token pruning methods.

The paper tackles the high computational demands of vision-language models for document understanding by proposing a lightweight token pruning framework that filters non-informative background regions from document images, achieving substantial computational cost reductions while maintaining comparable accuracy.

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes