CV AI CLSep 8, 2025

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

arXiv:2509.06415v16.21 citations

Originality Incremental advance

AI Analysis

This addresses efficiency issues for users of vision-language models in document processing applications, though it appears incremental as it builds on existing token pruning methods.

The paper tackles the high computational demands of vision-language models for document understanding by proposing a lightweight token pruning framework that filters non-informative background regions from document images, achieving substantial computational cost reductions while maintaining comparable accuracy.

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

View on arXiv PDF

Similar