DSAICVLGJun 10, 2021

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

arXiv:2106.05610v130 citations
Originality Incremental advance
AI Analysis

This work addresses the computational bottleneck in graph clustering for data scientists, offering significant speed improvements but is incremental as it builds on existing HAC methods.

The paper tackles the problem of hierarchical agglomerative clustering on graphs by developing efficient algorithms, achieving nearly-linear time for exact solutions with certain linkage measures and subquadratic time for average-linkage, with experimental speed-ups of 20.7–76.5x on real datasets.

We study the widely used hierarchical agglomerative clustering (HAC) algorithm on edge-weighted graphs. We define an algorithmic framework for hierarchical agglomerative graph clustering that provides the first efficient $\tilde{O}(m)$ time exact algorithms for classic linkage measures, such as complete- and WPGMA-linkage, as well as other measures. Furthermore, for average-linkage, arguably the most popular variant of HAC, we provide an algorithm that runs in $\tilde{O}(n\sqrt{m})$ time. For this variant, this is the first exact algorithm that runs in subquadratic time, as long as $m=n^{2-ε}$ for some constant $ε> 0$. We complement this result with a simple $ε$-close approximation algorithm for average-linkage in our framework that runs in $\tilde{O}(m)$ time. As an application of our algorithms, we consider clustering points in a metric space by first using $k$-NN to generate a graph from the point set, and then running our algorithms on the resulting weighted graph. We validate the performance of our algorithms on publicly available datasets, and show that our approach can speed up clustering of point datasets by a factor of 20.7--76.5x.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes