LGDSMLDec 31, 2019

Scalable Hierarchical Clustering with Tree Grafting

arXiv:2001.00076v141 citations
AI Analysis

This provides a scalable solution for hierarchical clustering in domains like author coreference, though it is incremental as it builds on linkage functions with new guarantees.

The paper tackles the problem of large-scale hierarchical clustering by introducing Grinch, an algorithm that efficiently reconfigures hierarchies as new points arrive, and proves it can recover ground-truth clusters under separability conditions, achieving higher accuracy and being orders of magnitude faster than existing methods.

We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes