ME LG MLSep 26, 2019

CS Sparse K-means: An Algorithm for Cluster-Specific Feature Selection in High-Dimensional Clustering

arXiv:1909.12384v22.32 citations

Originality Incremental advance

AI Analysis

This addresses a gap in high-dimensional clustering, particularly in genomics, by enabling feature selection that accounts for cluster-specific separability, though it is incremental as it builds on existing Sparse K-means methods.

The paper tackles the problem of cluster-specific feature selection in high-dimensional clustering, where features may only separate subsets of clusters, and proposes a K-means based algorithm that identifies informative features and the cluster pairs they separate, demonstrating results on simulated and leukemia gene expression data.

Feature selection is an important and challenging task in high dimensional clustering. For example, in genomics, there may only be a small number of genes that are differentially expressed, which are informative to the overall clustering structure. Existing feature selection methods, such as Sparse K-means, rarely tackle the problem of accounting features that can only separate a subset of clusters. In genomics, it is highly likely that a gene can only define one subtype against all the other subtypes or distinguish a pair of subtypes but not others. In this paper, we propose a K-means based clustering algorithm that discovers informative features as well as which cluster pairs are separable by each selected features. The method is essentially an EM algorithm, in which we introduce lasso-type constraints on each cluster pair in the M step, and make the E step possible by maximizing the raw cross-cluster distance instead of minimizing the intra-cluster distance. The results were demonstrated on simulated data and a leukemia gene expression dataset.

View on arXiv PDF

Similar