LGAIDSIRFeb 25, 2023

The Effect of Points Dispersion on the $k$-nn Search in Random Projection Forests

arXiv:2302.13160v12 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses scalability issues in high-dimensional nearest neighbor search for machine learning applications, but it is incremental as it refines existing methods.

The study investigated how point dispersion and the number of trees affect k-nearest neighbor search in random projection forests, finding that with more trees, point dispersion has minimal impact, so using the original algorithm with random directions is recommended.

Partitioning trees are efficient data structures for $k$-nearest neighbor search. Machine learning libraries commonly use a special type of partitioning trees called $k$d-trees to perform $k$-nn search. Unfortunately, $k$d-trees can be ineffective in high dimensions because they need more tree levels to decrease the vector quantization (VQ) error. Random projection trees rpTrees solve this scalability problem by using random directions to split the data. A collection of rpTrees is called rpForest. $k$-nn search in an rpForest is influenced by two factors: 1) the dispersion of points along the random direction and 2) the number of rpTrees in the rpForest. In this study, we investigate how these two factors affect the $k$-nn search with varying $k$ values and different datasets. We found that with larger number of trees, the dispersion of points has a very limited effect on the $k$-nn search. One should use the original rpTree algorithm by picking a random direction regardless of the dispersion of points.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes