Unsupervised Feature Selection to Identify Important ICD-10 Codes for Machine Learning: A Case Study on a Coronary Artery Disease Patient Cohort
This work addresses the problem of feature selection in large ICD code databases for healthcare researchers, but it is incremental as it compares existing methods on a specific dataset.
The study tackled the challenge of selecting relevant ICD-10 codes from over 9,000 options for machine learning in healthcare by comparing unsupervised feature selection methods on a cohort of 49,075 coronary artery disease patients, finding that Concrete Autoencoder methods outperformed others in reconstructing the feature space and predicting 90-day mortality.
The use of International Classification of Diseases (ICD) codes in healthcare presents a challenge in selecting relevant codes as features for machine learning models due to this system's large number of codes. In this study, we compared several unsupervised feature selection methods for an ICD code database of 49,075 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD tree weight adjustment to select the 100 best features from over 9,000 codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. Our findings revealed that the Concrete Autoencoder methods outperformed all other methods in both tasks. Furthermore, the weight adjustment in the Concrete Autoencoder method decreased the complexity of features.