Benjamin Shickel

LG
h-index57
26papers
1,858citations
Novelty32%
AI Score46

26 Papers

AIJun 30, 2023
Transformers in Healthcare: A Survey

Subhash Nerella, Sabyasachi Bandyopadhyay, Jiaqing Zhang et al.

With Artificial Intelligence (AI) increasingly permeating various aspects of society, including healthcare, the adoption of the Transformers neural network architecture is rapidly changing many applications. Transformer is a type of deep learning architecture initially developed to solve general-purpose Natural Language Processing (NLP) tasks and has subsequently been adapted in many fields, including healthcare. In this survey paper, we provide an overview of how this architecture has been adopted to analyze various forms of data, including medical imaging, structured and unstructured Electronic Health Records (EHR), social media, physiological signals, and biomolecular sequences. Those models could help in clinical diagnosis, report generation, data reconstruction, and drug/protein synthesis. We identified relevant studies using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We also discuss the benefits and limitations of using transformers in healthcare and examine issues such as computational cost, model interpretability, fairness, alignment with human values, ethical implications, and environmental impact.

CYJul 10, 2024
Promoting AI Competencies for Medical Students: A Scoping Review on Frameworks, Programs, and Tools

Yingbo Ma, Yukyeong Song, Jeremy A. Balch et al.

As more clinical workflows continue to be augmented by artificial intelligence (AI), AI literacy among physicians will become a critical requirement for ensuring safe and ethical AI-enabled patient care. Despite the evolving importance of AI in healthcare, the extent to which it has been adopted into traditional and often-overloaded medical curricula is currently unknown. In a scoping review of 1,699 articles published between January 2016 and June 2024, we identified 18 studies which propose guiding frameworks, and 11 studies documenting real-world instruction, centered around the integration of AI into medical education. We found that comprehensive guidelines will require greater clinical relevance and personalization to suit medical student interests and career trajectories. Current efforts highlight discrepancies in the teaching guidelines, emphasizing AI evaluation and ethics over technical topics such as data science and coding. Additionally, we identified several challenges associated with integrating AI training into the medical education program, including a lack of guidelines to define medical students AI literacy, a perceived lack of proven clinical value, and a scarcity of qualified instructors. With this knowledge, we propose an AI literacy framework to define competencies for medical students. To prioritize relevant and personalized AI education, we categorize literacy into four dimensions: Foundational, Practical, Experimental, and Ethical, with tailored learning objectives to the pre-clinical, clinical, and clinical research stages of medical education. This review provides a road map for developing practical and relevant education strategies for building an AI-competent healthcare workforce.

AINov 3, 2023
APRICOT-Mamba: Acuity Prediction in Intensive Care Unit (ICU): Development and Validation of a Stability, Transitions, and Life-Sustaining Therapies Prediction Model

Miguel Contreras, Brandon Silva, Benjamin Shickel et al.

The acuity state of patients in the intensive care unit (ICU) can quickly change from stable to unstable. Early detection of deteriorating conditions can result in providing timely interventions and improved survival rates. In this study, we propose APRICOT-M (Acuity Prediction in Intensive Care Unit-Mamba), a 150k-parameter state space-based neural network to predict acuity state, transitions, and the need for life-sustaining therapies in real-time in ICU patients. The model uses data obtained in the prior four hours in the ICU and patient information obtained at admission to predict the acuity outcomes in the next four hours. We validated APRICOT-M externally on data from hospitals not used in development (75,668 patients from 147 hospitals), temporally on data from a period not used in development (12,927 patients from one hospital from 2018-2019), and prospectively on data collected in real-time (215 patients from one hospital from 2021-2023) using three large datasets: the University of Florida Health (UFH) dataset, the electronic ICU Collaborative Research Database (eICU), and the Medical Information Mart for Intensive Care (MIMIC)-IV. The area under the receiver operating characteristic curve (AUROC) of APRICOT-M for mortality (external 0.94-0.95, temporal 0.97-0.98, prospective 0.96-1.00) and acuity (external 0.95-0.95, temporal 0.97-0.97, prospective 0.96-0.96) shows comparable results to state-of-the-art models. Furthermore, APRICOT-M can predict transitions to instability (external 0.81-0.82, temporal 0.77-0.78, prospective 0.68-0.75) and need for life-sustaining therapies, including mechanical ventilation (external 0.82-0.83, temporal 0.87-0.88, prospective 0.67-0.76), and vasopressors (external 0.81-0.82, temporal 0.73-0.75, prospective 0.66-0.74). This tool allows for real-time acuity monitoring in critically ill patients and can help clinicians make timely interventions.

QMMar 9, 2023
Computable Phenotypes to Characterize Changing Patient Brain Dysfunction in the Intensive Care Unit

Yuanfang Ren, Tyler J. Loftus, Ziyuan Guan et al.

In the United States, more than 5 million patients are admitted annually to ICUs, with ICU mortality of 10%-29% and costs over $82 billion. Acute brain dysfunction status, delirium, is often underdiagnosed or undervalued. This study's objective was to develop automated computable phenotypes for acute brain dysfunction states and describe transitions among brain dysfunction states to illustrate the clinical trajectories of ICU patients. We created two single-center, longitudinal EHR datasets for 48,817 adult patients admitted to an ICU at UFH Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acute brain dysfunction status including coma, delirium, normal, or death at 12-hour intervals of each ICU admission and to identify acute brain dysfunction phenotypes using continuous acute brain dysfunction status and k-means clustering approach. There were 49,770 admissions for 37,835 patients in UFH GNV dataset and 18,472 admissions for 10,982 patients in UFH JAX dataset. In total, 18% of patients had coma as the worst brain dysfunction status; every 12 hours, around 4%-7% would transit to delirium, 22%-25% would recover, 3%-4% would expire, and 67%-68% would remain in a coma in the ICU. Additionally, 7% of patients had delirium as the worst brain dysfunction status; around 6%-7% would transit to coma, 40%-42% would be no delirium, 1% would expire, and 51%-52% would remain delirium in the ICU. There were three phenotypes: persistent coma/delirium, persistently normal, and transition from coma/delirium to normal almost exclusively in first 48 hours after ICU admission. We developed phenotyping scoring algorithms that determined acute brain dysfunction status every 12 hours while admitted to the ICU. This approach may be useful in developing prognostic and decision-support tools to aid patients and clinicians in decision-making on resource use and escalation of care.

LGNov 3, 2023
The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU)

Jessica Sena, Mohammad Tahsin Mostafiz, Jiaqing Zhang et al.

Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.

LGMar 17
Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications

Yuanfang Ren, Varun Sai Vemuri, Zhenhong Hu et al.

Background: This study aims to develop and validate federated learning models for predicting major postoperative complications and mortality using a large multicenter dataset from the OneFlorida Data Trust. We hypothesize that federated learning models will offer robust generalizability while preserving data privacy and security. Methods: This retrospective, longitudinal, multicenter cohort study included 358,644 adult patients admitted to five healthcare institutions, who underwent 494,163 inpatient major surgical procedures from 2012-2023. We developed and internally and externally validated federated learning models to predict the postoperative risk of intensive care unit (ICU) admission, mechanical ventilation (MV) therapy, acute kidney injury (AKI), and in-hospital mortality. These models were compared with local models trained on data from a single center and central models trained on a pooled dataset from all centers. Performance was primarily evaluated using area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) values. Results: Our federated learning models demonstrated strong predictive performance, with AUROC scores consistently comparable or superior performance in terms of AUROC and AUPRC across all outcomes and sites. Our federated learning models also demonstrated strong generalizability, with comparable or superior performance in terms of both AUROC and AUPRC compared to the best local learning model at each site. Conclusions: By leveraging multicenter data, we developed robust, generalizable, and privacy-preserving predictive models for major postoperative complications and mortality. These findings support the feasibility of federated learning in clinical decision support systems.

LGJul 27, 2023
Identifying acute illness phenotypes via deep temporal interpolation and clustering network on physiologic signatures

Yuanfang Ren, Yanjun Li, Tyler J. Loftus et al.

Initial hours of hospital admission impact clinical trajectory, but early clinical decisions often suffer due to data paucity. With clustering analysis for vital signs within six hours of admission, patient phenotypes with distinct pathophysiological signatures and outcomes may support early clinical decisions. We created a single-center, longitudinal EHR dataset for 75,762 adults admitted to a tertiary care center for 6+ hours. We proposed a deep temporal interpolation and clustering network to extract latent representations from sparse, irregularly sampled vital sign data and derived distinct patient phenotypes in a training cohort (n=41,502). Model and hyper-parameters were chosen based on a validation cohort (n=17,415). Test cohort (n=16,845) was used to analyze reproducibility and correlation with biomarkers. The training, validation, and testing cohorts had similar distributions of age (54-55 yrs), sex (55% female), race, comorbidities, and illness severity. Four clusters were identified. Phenotype A (18%) had most comorbid disease with higher rate of prolonged respiratory insufficiency, acute kidney injury, sepsis, and three-year mortality. Phenotypes B (33%) and C (31%) had diffuse patterns of mild organ dysfunction. Phenotype B had favorable short-term outcomes but second-highest three-year mortality. Phenotype C had favorable clinical outcomes. Phenotype D (17%) had early/persistent hypotension, high rate of early surgery, and substantial biomarker rate of inflammation but second-lowest three-year mortality. After comparing phenotypes' SOFA scores, clustering results did not simply repeat other acuity assessments. In a heterogeneous cohort, four phenotypes with distinct categories of disease and outcomes were identified by a deep temporal interpolation and clustering network. This tool may impact triage decisions and clinical decision-support under time constraints.

LGMay 19
CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support

Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora et al.

Effective medication management in Parkinson's Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect'' produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.

IVJan 23, 2024
CIS-UNet: Multi-Class Segmentation of the Aorta in Computed Tomography Angiography via Context-Aware Shifted Window Self-Attention

Muhammad Imran, Jonathan R Krebs, Veera Rajasekhar Reddy Gopu et al.

Advancements in medical imaging and endovascular grafting have facilitated minimally invasive treatments for aortic diseases. Accurate 3D segmentation of the aorta and its branches is crucial for interventions, as inaccurate segmentation can lead to erroneous surgical planning and endograft construction. Previous methods simplified aortic segmentation as a binary image segmentation problem, overlooking the necessity of distinguishing between individual aortic branches. In this paper, we introduce Context Infused Swin-UNet (CIS-UNet), a deep learning model designed for multi-class segmentation of the aorta and thirteen aortic branches. Combining the strengths of Convolutional Neural Networks (CNNs) and Swin transformers, CIS-UNet adopts a hierarchical encoder-decoder structure comprising a CNN encoder, symmetric decoder, skip connections, and a novel Context-aware Shifted Window Self-Attention (CSW-SA) as the bottleneck block. Notably, CSW-SA introduces a unique utilization of the patch merging layer, distinct from conventional Swin transformers. It efficiently condenses the feature map, providing a global spatial context and enhancing performance when applied at the bottleneck layer, offering superior computational efficiency and segmentation accuracy compared to the Swin transformers. We trained our model on computed tomography (CT) scans from 44 patients and tested it on 15 patients. CIS-UNet outperformed the state-of-the-art SwinUNetR segmentation model, which is solely based on Swin transformers, by achieving a superior mean Dice coefficient of 0.713 compared to 0.697, and a mean surface distance of 2.78 mm compared to 3.39 mm. CIS-UNet's superior 3D aortic segmentation offers improved precision and optimization for planning endovascular treatments. Our dataset and code will be publicly available.

LGApr 10, 2024
Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision

Yingbo Ma, Suraj Kolla, Zhenhong Hu et al.

Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsity, varied recording frequencies, and temporal irregularities. To this end, this paper introduces a novel multimodal contrastive learning framework, specifically focusing on medical time series and clinical notes. To tackle the challenge of sparsity and irregular time intervals in medical time series, the framework integrates temporal cross-attention transformers with a dynamic embedding and tokenization scheme for learning multimodal feature representations. To harness the interconnected relationships between medical time series and clinical notes, the framework equips a global contrastive loss, aligning a patient's multimodal feature representations with the corresponding discharge summaries. Since discharge summaries uniquely pertain to individual patients and represent a holistic view of the patient's hospital stay, machine learning models are led to learn discriminative multimodal features via global contrasting. Extensive experiments with a real-world EHR dataset demonstrated that our framework outperformed state-of-the-art approaches on the exemplar task of predicting the occurrence of nine postoperative complications for more than 120,000 major inpatient surgeries using multimodal data from UF health system split among three hospitals (UF Health Gainesville, UF Health Jacksonville, and UF Health Jacksonville-North).

QMMay 27, 2025
Learning optimal treatment strategies for intraoperative hypotension using deep reinforcement learning

Esra Adiyeke, Tianqi Liu, Venkata Sai Dheeraj Naganaboina et al.

Traditional methods of surgical decision making heavily rely on human experience and prompt actions, which are variable. A data-driven system generating treatment recommendations based on patient states can be a substantial asset in perioperative decision-making, as in cases of intraoperative hypotension, for which suboptimal management is associated with acute kidney injury (AKI), a common and morbid postoperative complication. We developed a Reinforcement Learning (RL) model to recommend optimum dose of intravenous (IV) fluid and vasopressors during surgery to avoid intraoperative hypotension and postoperative AKI. We retrospectively analyzed 50,021 surgeries from 42,547 adult patients who underwent major surgery at a quaternary care hospital between June 2014 and September 2020. Of these, 34,186 surgeries were used for model training and 15,835 surgeries were reserved for testing. We developed a Deep Q-Networks based RL model using 16 variables including intraoperative physiologic time series, total dose of IV fluid and vasopressors extracted for every 15-minute epoch. The model replicated 69% of physician's decisions for the dosage of vasopressors and proposed higher or lower dosage of vasopressors than received in 10% and 21% of the treatments, respectively. In terms of IV fluids, the model's recommendations were within 0.05 ml/kg/15 min of the actual dose in 41% of the cases, with higher or lower doses recommended for 27% and 32% of the treatments, respectively. The model resulted in a higher estimated policy value compared to the physicians' actual treatments, as well as random and zero-drug policies. AKI prevalence was the lowest in patients receiving medication dosages that aligned with model's decisions. Our findings suggest that implementation of the model's policy has the potential to reduce postoperative AKI and improve other outcomes driven by intraoperative hypotension.

HCApr 18, 2024
Transparent AI: Developing an Explainable Interface for Predicting Postoperative Complications

Yuanfang Ren, Chirayu Tripathi, Ziyuan Guan et al.

Given the sheer volume of surgical procedures and the significant rate of postoperative fatalities, assessing and managing surgical complications has become a critical public health concern. Existing artificial intelligence (AI) tools for risk surveillance and diagnosis often lack adequate interpretability, fairness, and reproducibility. To address this, we proposed an Explainable AI (XAI) framework designed to answer five critical questions: why, why not, how, what if, and what else, with the goal of enhancing the explainability and transparency of AI models. We incorporated various techniques such as Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), counterfactual explanations, model cards, an interactive feature manipulation interface, and the identification of similar patients to address these questions. We showcased an XAI interface prototype that adheres to this framework for predicting major postoperative complications. This initial implementation has provided valuable insights into the vast explanatory potential of our XAI framework and represents an initial step towards its clinical adoption.

LGApr 9, 2024
Federated learning model for predicting major postoperative complications

Yonggi Park, Yuanfang Ren, Benjamin Shickel et al.

Background: The accurate prediction of postoperative complication risk using Electronic Health Records (EHR) and artificial intelligence shows great potential. Training a robust artificial intelligence model typically requires large-scale and diverse datasets. In reality, collecting medical data often encounters challenges surrounding privacy protection. Methods: This retrospective cohort study includes adult patients who were admitted to UFH Gainesville (GNV) (n = 79,850) and Jacksonville (JAX) (n = 28,636) for any type of inpatient surgical procedure. Using perioperative and intraoperative features, we developed federated learning models to predict nine major postoperative complications (i.e., prolonged intensive care unit stay and mechanical ventilation). We compared federated learning models with local learning models trained on a single site and central learning models trained on pooled dataset from two centers. Results: Our federated learning models achieved the area under the receiver operating characteristics curve (AUROC) values ranged from 0.81 for wound complications to 0.92 for prolonged ICU stay at UFH GNV center. At UFH JAX center, these values ranged from 0.73-0.74 for wound complications to 0.92-0.93 for hospital mortality. Federated learning models achieved comparable AUROC performance to central learning models, except for prolonged ICU stay, where the performance of federated learning models was slightly higher than central learning models at UFH GNV center, but slightly lower at UFH JAX center. In addition, our federated learning model obtained comparable performance to the best local learning model at each center, demonstrating strong generalizability. Conclusion: Federated learning is shown to be a useful tool to train robust and generalizable models from large scale data across multiple institutions where data protection barriers are high.

LGMar 6, 2024
Temporal Cross-Attention for Dynamic Embedding and Tokenization of Multimodal Electronic Health Records

Yingbo Ma, Suraj Kolla, Dhruv Kaliraman et al.

The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States.

LGFeb 6, 2024
Acute kidney injury prediction for non-critical care patients: a retrospective external and internal validation study

Esra Adiyeke, Yuanfang Ren, Benjamin Shickel et al.

Background: Acute kidney injury (AKI), the decline of kidney excretory function, occurs in up to 18% of hospitalized admissions. Progression of AKI may lead to irreversible kidney damage. Methods: This retrospective cohort study includes adult patients admitted to a non-intensive care unit at the University of Pittsburgh Medical Center (UPMC) (n = 46,815) and University of Florida Health (UFH) (n = 127,202). We developed and compared deep learning and conventional machine learning models to predict progression to Stage 2 or higher AKI within the next 48 hours. We trained local models for each site (UFH Model trained on UFH, UPMC Model trained on UPMC) and a separate model with a development cohort of patients from both sites (UFH-UPMC Model). We internally and externally validated the models on each site and performed subgroup analyses across sex and race. Results: Stage 2 or higher AKI occurred in 3% (n=3,257) and 8% (n=2,296) of UFH and UPMC patients, respectively. Area under the receiver operating curve values (AUROC) for the UFH test cohort ranged between 0.77 (UPMC Model) and 0.81 (UFH Model), while AUROC values ranged between 0.79 (UFH Model) and 0.83 (UPMC Model) for the UPMC test cohort. UFH-UPMC Model achieved an AUROC of 0.81 (95% confidence interval [CI] [0.80, 0.83]) for UFH and 0.82 (95% CI [0.81,0.84]) for UPMC test cohorts; an area under the precision recall curve values (AUPRC) of 0.6 (95% CI, [0.05, 0.06]) for UFH and 0.13 (95% CI, [0.11,0.15]) for UPMC test cohorts. Kinetic estimated glomerular filtration rate, nephrotoxic drug burden and blood urea nitrogen remained the top three features with the highest influence across the models and health centers. Conclusion: Locally developed models displayed marginally reduced discrimination when tested on another institution, while the top set of influencing features remained the same across the models and sites.

LGAug 14, 2025
Uncertainty-Aware Prediction of Parkinson's Disease Medication Needs: A Two-Stage Conformal Prediction Approach

Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora et al.

Parkinson's Disease (PD) medication management presents unique challenges due to heterogeneous disease progression and treatment response. Neurologists must balance symptom control with optimal dopaminergic dosing based on functional disability while minimizing side effects. This balance is crucial as inadequate or abrupt changes can cause levodopa-induced dyskinesia, wearing off, and neuropsychiatric effects, significantly reducing quality of life. Current approaches rely on trial-and-error decisions without systematic predictive methods. Despite machine learning advances, clinical adoption remains limited due to reliance on point predictions that do not account for prediction uncertainty, undermining clinical trust and utility. Clinicians require not only predictions of future medication needs but also reliable confidence measures. Without quantified uncertainty, adjustments risk premature escalation to maximum doses or prolonged inadequate symptom control. We developed a conformal prediction framework anticipating medication needs up to two years in advance with reliable prediction intervals and statistical guarantees. Our approach addresses zero-inflation in PD inpatient data, where patients maintain stable medication regimens between visits. Using electronic health records from 631 inpatient admissions at University of Florida Health (2011-2021), our two-stage approach identifies patients likely to need medication changes, then predicts required levodopa equivalent daily dose adjustments. Our framework achieved marginal coverage while reducing prediction interval lengths compared to traditional approaches, providing precise predictions for short-term planning and wider ranges for long-term forecasting. By quantifying uncertainty, our approach enables evidence-based decisions about levodopa dosing, optimizing symptom control while minimizing side effects and improving life quality.

CLJan 24, 2024
Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Darren Liu, Cheng Ding, Delgersuren Bold et al.

The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance. However, their performance in actual clinical applications has been underexplored. Traditional evaluations based on question-answering tasks don't fully capture the nuanced contexts. This gap highlights the need for more in-depth and practical assessments of LLMs in real-world healthcare settings. Objective: We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication. Methods: We investigated the performance of three general LLMs in understanding and processing real-world clinical notes. Concepts from 150 clinical notes were identified by MetaMap and then labeled by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using different prompts for an in-depth analysis. Results: GPT-4 showed overall superior performance compared to other LLMs. In contrast, both GPT-3.5 and text-davinci-003 exhibit enhanced performance when the appropriate prompting strategies are employed. The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities. Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized. This framework goes beyond singular performance aspects. With expert annotations, this methodology not only validates LLMs' capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.

LGNov 9, 2021
Multi-Task Prediction of Clinical Outcomes in the Intensive Care Unit using Flexible Multimodal Transformers

Benjamin Shickel, Patrick J. Tighe, Azra Bihorac et al.

Recent deep learning research based on Transformer model architectures has demonstrated state-of-the-art performance across a variety of domains and tasks, mostly within the computer vision and natural language processing domains. While some recent studies have implemented Transformers for clinical tasks using electronic health records data, they are limited in scope, flexibility, and comprehensiveness. In this study, we propose a flexible Transformer-based EHR embedding pipeline and predictive model framework that introduces several novel modifications of existing workflows that capitalize on data attributes unique to the healthcare domain. We showcase the feasibility of our flexible design in a case study in the intensive care unit, where our models accurately predict seven clinical outcomes pertaining to readmission and patient mortality over multiple future time horizons.

QMApr 27, 2020
Computable Phenotypes of Patient Acuity in the Intensive Care Unit

Yuanfang Ren, Jeremy Balch, Kenneth L. Abbott et al.

Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values.

LGApr 27, 2020
Dynamic Predictions of Postoperative Complications from Explainable, Uncertainty-Aware, and Multi-Task Deep Neural Networks

Benjamin Shickel, Tyler J. Loftus, Matthew Ruppert et al.

Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform random forest models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.

LGApr 27, 2020
Sequential Interpretability: Methods, Applications, and Future Direction for Understanding Deep Learning Models in the Context of Sequential Data

Benjamin Shickel, Parisa Rashidi

Deep learning continues to revolutionize an ever-growing number of critical application areas including healthcare, transportation, finance, and basic sciences. Despite their increased predictive power, model transparency and human explainability remain a significant challenge due to the "black box" nature of modern deep learning models. In many cases the desired balance between interpretability and performance is predominately task specific. Human-centric domains such as healthcare necessitate a renewed focus on understanding how and why these frameworks are arriving at critical and potentially life-or-death decisions. Given the quantity of research and empirical successes of deep learning for computer vision, most of the existing interpretability research has focused on image processing techniques. Comparatively, less attention has been paid to interpreting deep learning frameworks using sequential data. Given recent deep learning advancements in highly sequential domains such as natural language processing and physiological signal processing, the need for deep sequential explanations is at an all-time high. In this paper, we review current techniques for interpreting deep learning techniques involving sequential data, identify similarities to non-sequential methods, and discuss current limitations and future avenues of sequential interpretability research.

HCSep 16, 2019
Automatic Detection and Classification of Cognitive Distortions in Mental Health Text

Benjamin Shickel, Scott Siegel, Martin Heesacker et al.

In cognitive psychology, automatic and self-reinforcing irrational thought patterns are known as cognitive distortions. Left unchecked, patients exhibiting these types of thoughts can become stuck in negative feedback loops of unhealthy thinking, leading to inaccurate perceptions of reality commonly associated with anxiety and depression. In this paper, we present a machine learning framework for the automatic detection and classification of 15 common cognitive distortions in two novel mental health free text datasets collected from both crowdsourcing and a real-world online therapy program. When differentiating between distorted and non-distorted passages, our model achieved a weighted F1 score of 0.88. For classifying distorted passages into one of 15 distortion categories, our model yielded weighted F1 scores of 0.68 in the larger crowdsourced dataset and 0.45 in the smaller online counseling dataset, both of which outperformed random baseline metrics by a large margin. For both tasks, we also identified the most discriminative words and phrases between classes to highlight common thematic elements for improving targeted and therapist-guided mental health treatment. Furthermore, we performed an exploratory analysis using unsupervised content-based clustering and topic modeling algorithms as first efforts towards a data-driven perspective on the thematic relationship between similar cognitive distortions traditionally deemed unique. Finally, we highlight the difficulties in applying mental health-based machine learning in a real-world setting and comment on the implications and benefits of our framework for improving automated delivery of therapeutic treatment in conjunction with traditional cognitive-behavioral therapy.

HCApr 25, 2018
The Intelligent ICU Pilot Study: Using Artificial Intelligence Technology for Autonomous Patient Monitoring

Anis Davoudi, Kumar Rohit Malhotra, Benjamin Shickel et al.

Currently, many critical care indices are repetitively assessed and recorded by overburdened nurses, e.g. physical function or facial pain expressions of nonverbal patients. In addition, many essential information on patients and their environment are not captured at all, or are captured in a non-granular manner, e.g. sleep disturbance factors such as bright light, loud background noise, or excessive visitations. In this pilot study, we examined the feasibility of using pervasive sensing technology and artificial intelligence for autonomous and granular monitoring of critically ill patients and their environment in the Intensive Care Unit (ICU). As an exemplar prevalent condition, we also characterized delirious and non-delirious patients and their environment. We used wearable sensors, light and sound sensors, and a high-resolution camera to collected data on patients and their environment. We analyzed collected data using deep learning and statistical analysis. Our system performed face detection, face recognition, facial action unit detection, head pose detection, facial expression recognition, posture recognition, actigraphy analysis, sound pressure and light level detection, and visitation frequency detection. We were able to detect patient's face (Mean average precision (mAP)=0.94), recognize patient's face (mAP=0.80), and their postures (F1=0.94). We also found that all facial expressions, 11 activity features, visitation frequency during the day, visitation frequency during the night, light levels, and sound pressure levels during the night were significantly different between delirious and non-delirious patients (p-value<0.05). In summary, we showed that granular and autonomous monitoring of critically ill patients and their environment is feasible and can be used for characterizing critical care conditions and related environment factors.

LGFeb 28, 2018
DeepSOFA: A Continuous Acuity Score for Critically Ill Patients using Clinically Interpretable Deep Learning

Benjamin Shickel, Tyler J. Loftus, Lasith Adhikari et al.

Traditional methods for assessing illness severity and predicting in-hospital mortality among critically ill patients require time-consuming, error-prone calculations using static variable thresholds. These methods do not capitalize on the emerging availability of streaming electronic health record data or capture time-sensitive individual physiological patterns, a critical task in the intensive care unit. We propose a novel acuity score framework (DeepSOFA) that leverages temporal measurements and interpretable deep learning models to assess illness severity at any point during an ICU stay. We compare DeepSOFA with SOFA (Sequential Organ Failure Assessment) baseline models using the same model inputs and find that at any point during an ICU admission, DeepSOFA yields significantly more accurate predictions of in-hospital mortality. A DeepSOFA model developed in a public database and validated in a single institutional cohort had a mean AUC for the entire ICU stay of 0.90 (95% CI 0.90-0.91) compared with baseline SOFA models with mean AUC 0.79 (95% CI 0.79-0.80) and 0.85 (95% CI 0.85-0.86). Deep models are well-suited to identify ICU patients in need of life-saving interventions prior to the occurrence of an unexpected adverse event and inform shared decision-making processes among patients, providers, and families regarding goals of care and optimal resource utilization.

CLAug 4, 2017
Hashtag Healthcare: From Tweets to Mental Health Journals Using Deep Transfer Learning

Benjamin Shickel, Martin Heesacker, Sherry Benton et al.

As the popularity of social media platforms continues to rise, an ever-increasing amount of human communication and self- expression takes place online. Most recent research has focused on mining social media for public user opinion about external entities such as product reviews or sentiment towards political news. However, less attention has been paid to analyzing users' internalized thoughts and emotions from a mental health perspective. In this paper, we quantify the semantic difference between public Tweets and private mental health journals used in online cognitive behavioral therapy. We will use deep transfer learning techniques for analyzing the semantic gap between the two domains. We show that for the task of emotional valence prediction, social media can be successfully harnessed to create more accurate, robust, and personalized mental health models. Our results suggest that the semantic gap between public and private self-expression is small, and that utilizing the abundance of available social media is one way to overcome the small sample sizes of mental health data, which are commonly limited by availability and privacy concerns.

LGJun 12, 2017
Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis

Benjamin Shickel, Patrick Tighe, Azra Bihorac et al.

The past decade has seen an explosion in the amount of digital information stored in electronic health records (EHR). While primarily designed for archiving patient clinical information and administrative healthcare tasks, many researchers have found secondary use of these records for various clinical informatics tasks. Over the same period, the machine learning community has seen widespread advances in deep learning techniques, which also have been successfully applied to the vast amount of EHR data. In this paper, we review these deep EHR systems, examining architectures, technical aspects, and clinical applications. We also identify shortcomings of current techniques and discuss avenues of future research for EHR-based deep learning.