A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
For practitioners deploying visual RAG in finance, this study highlights a significant risk that single-vector aggregation may fail to capture fine-grained semantic differences, potentially leading to retrieval errors.
This paper investigates whether aggregating vision patch tokens into a single vector for visual document retrieval loses critical information, particularly in financial documents where small digit changes matter. The authors find that single-vector aggregation collapses distinct documents into nearly identical vectors, obscuring semantic changes, and identify global texture dominance as the root cause.
Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.