LGMLApr 6, 2017

An Online Hierarchical Algorithm for Extreme Clustering

arXiv:1704.01858v113 citations
Originality Highly original
AI Analysis

This addresses the bottleneck in clustering methods that scale well to many data items but not to many clusters, offering a solution for applications requiring efficient handling of massive datasets with numerous clusters.

The paper tackles the problem of scaling clustering algorithms to both large numbers of data points (N) and clusters (K), termed extreme clustering, by introducing PERCH, an online hierarchical algorithm that constructs more accurate trees and achieves higher quality clustering than competitors in nearly half the time.

Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes