LG APNov 8, 2022

Significance-Based Categorical Data Clustering

Lianyu Hu, Mudi Jiang, Yan Liu, Zengyou He

arXiv:2211.03956v15.84 citationsh-index: 31Has Code

Originality Incremental advance

AI Analysis

This addresses a gap in statistical validation for categorical clustering, though it appears incremental in methodology.

The paper tackles the problem of assessing statistical significance in categorical data clustering by developing a likelihood ratio test statistic as an objective function, which achieves comparable performance to state-of-the-art methods and improves cluster validation and number estimation.

Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.

View on arXiv PDF Code

Similar