DS CG LGOct 29, 2023

Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations

arXiv:2310.19126v113.843 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This provides theoretical guarantees and limitations for practitioners using these popular ANN search tools, revealing potential performance pitfalls in real-world applications.

The paper analyzes the worst-case performance of popular graph-based approximate nearest neighbor search implementations (HNSW, NSG, DiskANN), showing that DiskANN's slow preprocessing version provably achieves constant approximation ratio and poly-logarithmic query time on bounded-dimension data, while other variants can require linear query time (e.g., at least 0.1n steps for DiskANN to find any of the 5 nearest neighbors).

Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least $0.1 n$ steps on instances of size $n$ before it encounters any of the $5$ nearest neighbors of the query.

View on arXiv PDF Code

Similar