AIJun 13, 2022
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort CollaborativeElena Casiraghi, Rachel Wong, Margaret Hall et al.
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
CLAug 9, 2022
A Multimodal Transformer: Fusing Clinical Notes with Structured EHR Data for Interpretable In-Hospital Mortality PredictionWeimin Lyu, Xinyu Dong, Rachel Wong et al.
Deep-learning-based clinical decision support using structured electronic health records (EHR) has been an active research area for predicting risks of mortality and diseases. Meanwhile, large amounts of narrative clinical notes provide complementary information, but are often not integrated into predictive models. In this paper, we provide a novel multimodal transformer to fuse clinical notes and structured EHR data for better prediction of in-hospital mortality. To improve interpretability, we propose an integrated gradients (IG) method to select important words in clinical notes and discover the critical structured EHR features with Shapley values. These important words and clinical features are visualized to assist with interpretation of the prediction outcomes. We also investigate the significance of domain adaptive pretraining and task adaptive fine-tuning on the Clinical BERT, which is used to learn the representations of clinical notes. Experiments demonstrated that our model outperforms other methods (AUCPR: 0.538, AUCROC: 0.877, F1:0.490).
LGMay 9, 2023
Enhancing Clinical Predictive Modeling through Model Complexity-Driven Class Proportion Tuning for Class Imbalanced Data: An Empirical Study on Opioid Overdose PredictionYinan Liu, Xinyu Dong, Weimin Lyu et al.
Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that links the optimal class proportions to the model complexity, thereby tuning the class proportions per model. Our experiments on the opioid overdose prediction problem highlight the performance gain of tuning class proportions. Rigorous regression analysis also confirms the advantages of the theoretical framework proposed and the statistically significant correlation between the hyperparameters controlling the model complexity and the optimal class proportions.