LGOct 12, 2020

The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms

arXiv:2010.05473v122 citations
AI Analysis

This addresses a specific bottleneck in clustering algorithms for data analysis, but appears incremental as it modifies existing methods rather than introducing a new paradigm.

The paper tackles the issue of agglomerative hierarchical clustering (AHC) methods struggling to identify adjacent clusters with varied densities, and shows that using a data-dependent kernel, specifically Isolation Kernel, improves dendrogram quality compared to distance and other kernels.

Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes