ME MLOct 5, 2016

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

arXiv:1610.01424v23 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of validating clusters in fields like genomics, though it is incremental as it builds on existing clustering significance testing.

The paper tackles the problem of determining whether clusters identified in high-dimensional data represent distinct subgroups, proposing a non-parametric method that compares within-cluster sum of squares to a unimodal null distribution and finds it accurately tests for clustering presence.

Cluster analysis is an unsupervised learning strategy that can be employed to identify subgroups of observations in data sets of unknown structure. This strategy is particularly useful for analyzing high-dimensional data such as microarray gene expression data. Many clustering methods are available, but it is challenging to determine if the identified clusters represent distinct subgroups. We propose a novel strategy to investigate the significance of identified clusters by comparing the within- cluster sum of squares from the original data to that produced by clustering an appropriate unimodal null distribution. The null distribution we present for this problem uses kernel density estimation and thus does not require that the data follow any particular distribution. We find that our method can accurately test for the presence of clustering even when the number of features is high.

View on arXiv PDF

Similar