MLLGFeb 28, 2018

Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering

arXiv:1802.10549v270 citations
Originality Incremental advance
AI Analysis

This provides a visual and automated tool for data scientists to understand the structure of high-dimensional data, though it is incremental as it builds on existing Density Peak clustering methods.

The paper tackles the problem of analyzing high-dimensional data by introducing a method that automatically generates a topography of the probability density, identifying peaks, valleys, and their hierarchical organization with error estimation to distinguish genuine features from sampling noise. It demonstrates this approach as a robust extension of clustering partitions for complex data sets.

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the "valleys" separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks' height, their statistical reliability, and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes