Neighborhood Stability as a Measure of Nearest Neighbor Searchability
This work addresses a gap in evaluating clustering-based ANNS for high-dimensional datasets, providing tools to determine searchability, which is incremental as it builds on existing ANNS methods without introducing a new search algorithm.
The authors tackled the lack of analytical tools to assess the suitability of clustering-based Approximate Nearest Neighbor Search (ANNS) for datasets, introducing two measures—Clustering-Neighborhood Stability Measure (clustering-NSM) and Point-Neighborhood Stability Measure (point-NSM)—that predict ANNS accuracy and clusterability based on nearest neighbor relationships, applicable to various distance functions.
Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.