IRCLMay 19, 2022

PLAID: An Efficient Engine for Late Interaction Retrieval

arXiv:2205.09707v1155 citationsh-index: 76
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks for researchers and practitioners using late interaction retrieval in information retrieval, representing an incremental improvement focused on speed optimization.

The paper tackles the high search latency of late interaction retrieval models like ColBERTv2 by introducing PLAID, an optimized engine that reduces latency by up to 7x on GPU and 45x on CPU while maintaining state-of-the-art retrieval quality across large-scale benchmarks.

Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7$\times$ on a GPU and 45$\times$ on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes