LGJul 6, 2022

Ensemble feature selection with clustering for analysis of high-dimensional, correlated clinical data in the search for Alzheimer's disease biomarkers

Annette Spooner, Gelareh Mohammadi, Perminder S. Sachdev, Henry Brodaty, Arcot Sowmya

arXiv:2207.02380v11.81 citationsh-index: 128

Originality Incremental advance

AI Analysis

This work addresses the challenge of identifying reliable biomarkers for Alzheimer's disease from noisy clinical data, representing an incremental improvement in feature selection techniques for healthcare applications.

The authors tackled the problem of unstable feature selection in high-dimensional clinical data with correlated features by introducing a novel ensemble framework that incorporates clustering to mitigate biases. Their method showed marked improvement in feature selection stability and identified features consistent with Alzheimer's disease literature.

Healthcare datasets often contain groups of highly correlated features, such as features from the same biological system. When feature selection is applied to these datasets to identify the most important features, the biases inherent in some multivariate feature selectors due to correlated features make it difficult for these methods to distinguish between the important and irrelevant features and the results of the feature selection process can be unstable. Feature selection ensembles, which aggregate the results of multiple individual base feature selectors, have been investigated as a means of stabilising feature selection results, but do not address the problem of correlated features. We present a novel framework to create feature selection ensembles from multivariate feature selectors while taking into account the biases produced by groups of correlated features, using agglomerative hierarchical clustering in a pre-processing step. These methods were applied to two real-world datasets from studies of Alzheimer's disease (AD), a progressive neurodegenerative disease that has no cure and is not yet fully understood. Our results show a marked improvement in the stability of features selected over the models without clustering, and the features selected by these models are in keeping with the findings in the AD literature.

View on arXiv PDF

Similar