Multi-Vector Index Compression in Any Modality
This addresses efficiency issues for retrieval in image-, video-, and audio-rich corpora, representing an incremental improvement with novel method elements.
The paper tackles the high computation and storage costs of multi-vector retrieval in late interaction across modalities by introducing query-agnostic compression methods, with attention-guided clustering outperforming others and achieving competitive or improved performance compared to uncompressed indexes on tasks like BEIR, ViDoRe, MSR-VTT, and MultiVENT 2.0.
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.