Variable Selection Methods for Model-based Clustering
This is an incremental review that addresses the problem of dimensionality and interpretability in clustering for researchers and practitioners in fields using multivariate data.
The paper reviews variable selection techniques for model-based clustering to handle high-dimensional data and improve interpretability, summarizing existing methods and illustrating their application with R packages on two data analysis examples.
Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.