A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice
This addresses the need for efficient verification of k-NN graph accuracy in data analysis applications, though it is incremental as it builds on existing property testing and k-NN theory.
The paper tackles the problem of verifying the accuracy of k-nearest neighbor (k-NN) graphs by developing a property testing algorithm that determines if a graph is a k-NN graph or far from it, with theoretical complexity O(√n k²/ε²) and empirical evaluation showing it detects inaccurate models faster than building them.
In the $k$-nearest neighborhood model ($k$-NN), we are given a set of points $P$, and we shall answer queries $q$ by returning the $k$ nearest neighbors of $q$ in $P$ according to some metric. This concept is crucial in many areas of data analysis and data processing, e.g., computer vision, document retrieval and machine learning. Many $k$-NN algorithms have been published and implemented, but often the relation between parameters and accuracy of the computed $k$-NN is not explicit. We study property testing of $k$-NN graphs in theory and evaluate it empirically: given a point set $P \subset \mathbb{R}^δ$ and a directed graph $G=(P,E)$, is $G$ a $k$-NN graph, i.e., every point $p \in P$ has outgoing edges to its $k$ nearest neighbors, or is it $ε$-far from being a $k$-NN graph? Here, $ε$-far means that one has to change more than an $ε$-fraction of the edges in order to make $G$ a $k$-NN graph. We develop a randomized algorithm with one-sided error that decides this question, i.e., a property tester for the $k$-NN property, with complexity $O(\sqrt{n} k^2 / ε^2)$ measured in terms of the number of vertices and edges it inspects, and we prove a lower bound of $Ω(\sqrt{n / εk})$. We evaluate our tester empirically on the $k$-NN models computed by various algorithms and show that it can be used to detect $k$-NN models with bad accuracy in significantly less time than the building time of the $k$-NN model.