ST LG OC MLApr 6, 2015

A Probabilistic $\ell_1$ Method for Clustering High Dimensional Data

arXiv:1504.01294v2

Originality Incremental advance

AI Analysis

This addresses the challenge of unreliable distances in high-dimensional spaces for clustering, but it appears incremental as it builds on existing distance-based methods with a specific metric adaptation.

The paper tackles the problem of clustering high-dimensional data by proposing a distance-based iterative method using the ℓ1-metric, which is less sensitive to high dimensionality than Euclidean distance, and reports that its performance improves significantly as dimension increases.

In general, the clustering problem is NP-hard, and global optimality cannot be established for non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a distance-based iterative method for clustering data in very high-dimensional space, using the $\ell_1$-metric that is less sensitive to high dimensionality than the Euclidean distance. For $K$ clusters in $\mathbb{R}^n$, the problem decomposes to $K$ problems coupled by probabilities, and an iteration reduces to finding $Kn$ weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data space, and its performance was observed to improve significantly as the dimension increases.

View on arXiv PDF

Similar