Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors
This work addresses the challenge of enabling efficient similarity search on dense vectors in widely-used search libraries, which is an incremental improvement for developers and researchers in information retrieval and NLP.
The paper tackled the problem of adapting the Lucene search library for approximate nearest-neighbor search on arbitrary dense vectors, such as word embeddings, by evaluating three integration techniques and finding that the 'fake words' approach offered the best balance between effectiveness and efficiency.
We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study. At its core, Lucene is built around inverted indexes of a document collection's (sparse) term-document matrix, which is incompatible with the lower-dimensional dense vectors that are common in deep learning applications. We evaluate three techniques to overcome these challenges that can all be natively integrated into Lucene: the creation of documents populated with fake words, LSH applied to lexical realizations of dense vectors, and k-d trees coupled with dimensionality reduction. Experiments show that the "fake words" approach represents the best balance between effectiveness and efficiency. These techniques are integrated into the Anserini open-source toolkit and made available to the community.