CODMDSPRMLNov 13, 2017

Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science

arXiv:1711.04712v116 citations
Originality Incremental advance
AI Analysis

This work addresses the efficiency of constructing affinity matrices in data science by providing a sparser alternative to traditional nearest neighbor graphs, which is incremental but offers practical computational benefits.

The paper tackles the problem of constructing sparse random graphs with giant connected components by proving that connecting each point to a logarithmic number of randomly chosen nearest neighbors suffices, reducing edges from ~n log n to ~n log log n while maintaining connectivity. This result is demonstrated with numerical examples showing it can simplify and accelerate computations in data science applications.

If we pick $n$ random points uniformly in $[0,1]^d$ and connect each point to its $k-$nearest neighbors, then it is well known that there exists a giant connected component with high probability. We prove that in $[0,1]^d$ it suffices to connect every point to $ c_{d,1} \log{\log{n}}$ points chosen randomly among its $ c_{d,2} \log{n}-$nearest neighbors to ensure a giant component of size $n - o(n)$ with high probability. This construction yields a much sparser random graph with $\sim n \log\log{n}$ instead of $\sim n \log{n}$ edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of picking the $k-$nearest neighbors, one can often pick $k' \ll k$ random points out of the $k-$nearest neighbors without sacrificing efficiency. This can massively simplify and accelerate computation, we illustrate this with several numerical examples.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes