IRAug 25, 2021

On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

arXiv:2108.11480v121 citations
Originality Synthesis-oriented
AI Analysis

This work addresses efficiency issues for dense retrieval systems, but it is incremental as it builds on existing methods like ColBERT.

The paper tackles the efficiency problem in multi-stage dense retrieval by using approximate nearest neighbour scores to rank candidate documents, reducing the candidate set to 200 documents without significant loss in effectiveness and achieving a 2x speedup.

Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes