DBDCIRMay 24

PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing

arXiv:2602.2124740.2h-index: 29
Predicted impact top 38% in DB · last 90 daysOriginality Highly original
AI Analysis

For practitioners needing fast and scalable ANN index construction, PiPNN dramatically reduces build time while maintaining high query throughput.

PiPNN introduces a graph construction algorithm for approximate nearest neighbor search that avoids the search bottleneck of existing methods, achieving up to 11.6x faster index construction than Vamana and up to 12.9x faster than HNSW, enabling billion-scale index construction in under 20 minutes on a single multicore machine.

The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from. PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes