Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

arXiv:2601.20107v11.5

Originality Incremental advance

AI Analysis

This provides a scalable solution for Visual RAG systems, addressing a domain-specific bottleneck in visual document retrieval.

The paper tackles the problem of high index vector size overheads in Vision-Language Models for Visual Document Retrieval by proposing Structural Anchor Pruning (SAP), a training-free method that reduces index vectors by over 90% while maintaining robust retrieval fidelity on the ViDoRe benchmark.

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.

View on arXiv PDF

Similar