ML LGMay 28, 2025

A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning

Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

arXiv:2505.22554v41 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses feature selection for diabetes risk prediction in public health and clinical medicine, offering an incremental improvement over existing methods by focusing on tail dependencies.

The authors tackled the problem of feature selection for medical risk prediction by developing a supervised filter that uses the Gumbel copula upper tail dependence coefficient to prioritize features with extreme associations in patient strata, rather than average associations. On a large diabetes dataset (CDC, N=253,680), their method reduced features by about 52% while maintaining strong discrimination and outperformed some baselines, and on a clinical benchmark (PIMA, N=768), it achieved the numerically highest ROC AUC.

Effective feature selection is vital for robust and interpretable medical prediction, especially for identifying risk factors concentrated in extreme patient strata. Standard methods emphasize average associations and may miss predictors whose importance lies in the tails of the distribution. We propose a computationally efficient supervised filter that ranks features using the Gumbel copula upper tail dependence coefficient ($λ_U$), prioritizing variables that are simultaneously extreme with the positive class. We benchmarked against Mutual Information, mRMR, ReliefF, and $L_1$ Elastic Net across four classifiers on two diabetes datasets: a large public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Evaluation included paired statistical tests, permutation importance, and robustness checks with label flips, feature noise, and missingness. On CDC, our method was the fastest selector and reduced the feature space by about 52% while retaining strong discrimination. Although using all 21 features yielded the highest AUC, our filter significantly outperformed Mutual Information and mRMR and was statistically indistinguishable from ReliefF. On PIMA, with only eight predictors, our ranking produced the numerically highest ROC AUC, and no significant differences were found versus strong baselines. Across both datasets, the upper tail criterion consistently identified clinically coherent, impactful predictors. We conclude that copula based feature selection via upper tail dependence is a powerful, efficient, and interpretable approach for building risk models in public health and clinical medicine.

View on arXiv PDF

Similar