ML LGNov 2, 2018

A Fast Algorithm for Clustering High Dimensional Feature Vectors

arXiv:1811.00956v12 citations

Originality Incremental advance

AI Analysis

This provides a fast and accurate clustering solution for high-dimensional data, such as in genomics, but is incremental as it builds on existing matrix-based approaches.

The paper tackles the problem of clustering high-dimensional data by proposing an algorithm that exploits the cluster-dependent structure of an N×N matrix, making it faster than other methods, especially when N << P. It demonstrates superior accuracy, outperforming 16 other algorithms on 32 genomic datasets more than twice as often as its closest competitors.

We propose an algorithm for clustering high dimensional data. If $P$ features for $N$ objects are represented in an $N\times P$ matrix ${\bf X}$, where $N\ll P$, the method is based on exploiting the cluster-dependent structure of the $N\times N$ matrix ${\bf XX}^T$. Computational burden thus depends primarily on $N$, the number of objects to be clustered, rather than $P$, the number of features that are measured. This makes the method particularly useful in high dimensional settings, where it is substantially faster than a number of other popular clustering algorithms. Aside from an upper bound on the number of potential clusters, the method is independent of tuning parameters. When compared to $16$ other clustering algorithms on $32$ genomic datasets with gold standards, we show that it provides the most accurate cluster configuration more than twice as often than its closest competitors. We illustrate the method on data taken from highly cited genomic studies.

View on arXiv PDF

Similar