MLLGMay 24, 2023

Hierarchical clustering with dot products recovers hidden tree structure

arXiv:2305.15022v32 citations
Originality Highly original
AI Analysis

This work addresses the challenge of accurately uncovering hidden tree structures in data for applications in fields like bioinformatics or data analysis, offering a novel method that improves upon standard clustering techniques.

The paper tackles the problem of recovering hidden hierarchical structure in data by proposing a variant of agglomerative clustering that merges clusters based on maximum average dot product instead of traditional metrics like distance or variance. It demonstrates that this method provides a valid estimate of generative hierarchical structure under a probabilistic graphical model and shows superior tree recovery performance over existing approaches such as UPGMA, Ward's method, and HDBSCAN on real data.

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes