MLLGSTSep 3, 2019

Rates of Convergence for Large-scale Nearest Neighbor Classification

arXiv:1909.01464v26 citations
AI Analysis

This work addresses the challenge of distributed nearest neighbor classification for large-scale data, but it is incremental as it builds on existing divide-and-conquer schemes with theoretical analysis and acceleration.

The paper tackles the problem of scaling nearest neighbor classification to large datasets that cannot fit in a single machine's memory by proposing a divide-and-conquer method called bigNN, which achieves the same optimal convergence rates as an oracle classifier in terms of excess risk and classification instability. It also introduces a pre-training acceleration technique to reduce prediction time while maintaining convergence rates, with numerical studies verifying the theoretical results.

Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor $k$ should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes