IRAICLApr 17, 2025

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

arXiv:2504.12778v14 citationsh-index: 2Has CodeSIGIR
Originality Incremental advance
AI Analysis

This addresses memory efficiency for IR systems, offering a practical improvement for deployment, though it is incremental as it builds on existing models.

The paper tackles the high memory requirement of late-interaction retrieval models like ColBERT by proposing a principled token pruning method that preserves retrieval performance, achieving results with only 30% of tokens.

Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes