Hierarchical Clustering better than Average-Linkage
This work addresses a theoretical bottleneck in hierarchical clustering for researchers and practitioners, providing improved algorithms for specific objectives, though it is incremental relative to prior work.
The paper tackled the problem of hierarchical clustering by analyzing two specific objectives and showing that average-linkage algorithms cannot achieve better than 1/3 and 2/3 approximations in worst-case scenarios, matching random solutions. It introduced two new algorithms based on semidefinite programming that provably outperform average-linkage for these objectives.
Hierarchical Clustering (HC) is a widely studied problem in exploratory data analysis, usually tackled by simple agglomerative procedures like average-linkage, single-linkage or complete-linkage. In this paper we focus on two objectives, introduced recently to give insight into the performance of average-linkage clustering: a similarity based HC objective proposed by [Moseley and Wang, 2017] and a dissimilarity based HC objective proposed by [Cohen-Addad et al., 2018]. In both cases, we present tight counterexamples showing that average-linkage cannot obtain better than 1/3 and 2/3 approximations respectively (in the worst-case), settling an open question raised in [Moseley and Wang, 2017]. This matches the approximation ratio of a random solution, raising a natural question: can we beat average-linkage for these objectives? We answer this in the affirmative, giving two new algorithms based on semidefinite programming with provably better guarantees.