An Online Hierarchical Algorithm for Extreme Clustering
This addresses the bottleneck in clustering methods that scale well to many data items but not to many clusters, offering a solution for applications requiring efficient handling of massive datasets with numerous clusters.
The paper tackles the problem of scaling clustering algorithms to both large numbers of data points (N) and clusters (K), termed extreme clustering, by introducing PERCH, an online hierarchical algorithm that constructs more accurate trees and achieves higher quality clustering than competitors in nearly half the time.
Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.