LG CV MLMay 16, 2020

Revisiting Agglomerative Clustering

Eric K. Tokuda, Cesar H. Comin, Luciano da F. Costa

arXiv:2005.07995v27.999 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the issue of false positives in clustering for data analysis, but it is incremental as it revisits and evaluates existing methods without introducing new techniques.

The study tackled the problem of false positives in agglomerative clustering by testing methods like single-linkage on various datasets, finding that many methods incorrectly detect two clusters in unimodal data and that single-linkage is more resilient to false positives.

An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods detected clusters not corresponding directly to the nucleus. The possibility of identifying the type of distribution was also investigated.

View on arXiv PDF

Similar