LGOct 22, 2020

Scalable Hierarchical Agglomerative Clustering

arXiv:2010.11821v360 citations
Originality Highly original
AI Analysis

This addresses the problem of over-merging and poor scalability in clustering for large-scale data applications, offering a novel approximation for non-parametric clustering.

The paper tackles the scalability limitations of hierarchical agglomerative clustering by presenting a method that scales to billions of data points without sacrificing quality, achieving state-of-the-art results on benchmarks and demonstrating scalability on a dataset of 30 billion queries.

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition, but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes