STMLJun 9, 2014

Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

arXiv:1406.2206v126 citations
Originality Incremental advance
AI Analysis

This addresses clustering in high-dimensional data for applications like genomics or image analysis, offering an incremental improvement by relaxing assumptions on cluster shape and mean separation.

The paper tackles clustering in high-dimensional Gaussian mixture models with non-spherical clusters, where only a few dimensions are relevant, by combining Gaussian mixture parameter learning with sparse LDA to estimate cluster assignments and relevant features. The results show that sample complexity depends on sparsity and scales logarithmically with dimension, requiring milder assumptions than prior work.

We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes