AIMar 18, 2023Code
A general-purpose AI assistant embedded in an open-source radiology information systemSaptarshi Purkayastha, Rohan Isaac, Sharon Anthony et al.
Radiology AI models have made significant progress in near-human performance or surpassing it. However, AI model's partnership with human radiologist remains an unexplored challenge due to the lack of health information standards, contextual and workflow differences, and data labeling variations. To overcome these challenges, we integrated an AI model service that uses DICOM standard SR annotations into the OHIF viewer in the open-source LibreHealth Radiology Information Systems (RIS). In this paper, we describe the novel Human-AI partnership capabilities of the platform, including few-shot learning and swarm learning approaches to retrain the AI models continuously. Building on the concept of machine teaching, we developed an active learning strategy within the RIS, so that the human radiologist can enable/disable AI annotations as well as "fix"/relabel the AI annotations. These annotations are then used to retrain the models. This helps establish a partnership between the radiologist user and a user-specific AI model. The weights of these user-specific models are then finally shared between multiple models in a swarm learning approach.
CVApr 6, 2022Code
OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval SystemXiaoyuan Guo, Jiali Duan, Saptarshi Purkayastha et al.
Improving the retrieval relevance on noisy datasets is an emerging need for the curation of a large-scale clean dataset in the medical domain. While existing methods can be applied for class-wise retrieval (aka. inter-class), they cannot distinguish the granularity of likeness within the same class (aka. intra-class). The problem is exacerbated on medical external datasets, where noisy samples of the same class are treated equally during training. Our goal is to identify both intra/inter-class similarities for fine-grained retrieval. To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), consisting of two steps. First, we train an outlier detector on a clean internal dataset in an unsupervised manner. Then we use the trained detector to generate the anomaly scores on the external dataset, whose distribution will be used to bin intra-class variations. Second, we propose a quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class negatives nintra are sampled from bins of the same class other than the bin anchor a belongs to, while niner are randomly sampled from inter-classes. We suggest a weighted metric learning objective to balance the intra and inter-class feature learning. We experimented on two representative public radiography datasets. Experiments show the effectiveness of our approach. The training and evaluation code can be found in https://github.com/XiaoyuanGuo/oscars.
CVNov 15, 2023
Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging ResearchBardia Khosravi, Frank Li, Theo Dapamede et al.
Chest X-rays (CXR) are essential for diagnosing a variety of conditions, but when used on new populations, model generalizability issues limit their efficacy. Generative AI, particularly denoising diffusion probabilistic models (DDPMs), offers a promising approach to generating synthetic images, enhancing dataset diversity. This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research. The study employed DDPMs to create synthetic CXRs conditioned on demographic and pathological characteristics from the CheXpert dataset. These synthetic images were used to supplement training datasets for pathology classifiers, with the aim of improving their performance. The evaluation involved three datasets (CheXpert, MIMIC-CXR, and Emory Chest X-ray) and various experiments, including supplementing real data with synthetic data, training with purely synthetic data, and mixing synthetic data with external datasets. Performance was assessed using the area under the receiver operating curve (AUROC). Adding synthetic data to real datasets resulted in a notable increase in AUROC values (up to 0.02 in internal and external test sets with 1000% supplementation, p-value less than 0.01 in all instances). When classifiers were trained exclusively on synthetic data, they achieved performance levels comparable to those trained on real data with 200%-300% data supplementation. The combination of real and synthetic data from different sources demonstrated enhanced model generalizability, increasing model AUROC from 0.76 to 0.80 on the internal test set (p-value less than 0.01). In conclusion, synthetic data supplementation significantly improves the performance and generalizability of pathology classifiers in medical imaging.
IVApr 16, 2022
Few-Shot Transfer Learning to improve Chest X-Ray pathology detection using limited tripletsAnanth Reddy Bhimireddy, John Lee Burns, Saptarshi Purkayastha et al.
Deep learning approaches applied to medical imaging have reached near-human or better-than-human performance on many diagnostic tasks. For instance, the CheXpert competition on detecting pathologies in chest x-rays has shown excellent multi-class classification performance. However, training and validating deep learning models require extensive collections of images and still produce false inferences, as identified by a human-in-the-loop. In this paper, we introduce a practical approach to improve the predictions of a pre-trained model through Few-Shot Learning (FSL). After training and validating a model, a small number of false inference images are collected to retrain the model using \textbf{\textit{Image Triplets}} - a false positive or false negative, a true positive, and a true negative. The retrained FSL model produces considerable gains in performance with only a few epochs and few images. In addition, FSL opens rapid retraining opportunities for human-in-the-loop systems, where a radiologist can relabel false inferences, and the model can be quickly retrained. We compare our retrained model performance with existing FSL approaches in medical imaging that train and evaluate models at once.
CVNov 12, 2025
Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and SegmentationFrank Li, Theo Dapamede, Mohammadreza Chavoshi et al.
Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.
IVOct 29, 2021Code
CVAD: A generic medical anomaly detector based on Cascade VAEXiaoyuan Guo, Judy Wawira Gichoya, Saptarshi Purkayastha et al.
Detecting out-of-distribution (OOD) samples in medical imaging plays an important role for downstream medical diagnosis. However, existing OOD detectors are demonstrated on natural images composed of inter-classes and have difficulty generalizing to medical images. The key issue is the granularity of OOD data in the medical domain, where intra-class OOD samples are predominant. We focus on the generalizability of OOD detection for medical images and propose a self-supervised Cascade Variational autoencoder-based Anomaly Detector (CVAD). We use a variational autoencoders' cascade architecture, which combines latent representation at multiple scales, before being fed to a discriminator to distinguish the OOD data from the in-distribution (ID) data. Finally, both the reconstruction error and the OOD probability predicted by the binary discriminator are used to determine the anomalies. We compare the performance with the state-of-the-art deep learning models to demonstrate our model's efficacy on various open-access medical imaging datasets for both intra- and inter-class OOD. Further extensive results on datasets including common natural datasets show our model's effectiveness and generalizability. The code is available at https://github.com/XiaoyuanGuo/CVAD.
HCOct 24, 2019Code
Development and Implementation of a Dashboard for Diabetes Care Management in OpenMRSBhanu Teja Yandrapalli, Josette Jones, Saptarshi Purkayastha
A clinical dashboard for a patient's diabetes condition helps physicians to make better decisions based on readily available information. OpenMRS is a widely used open-source electronic health records system but does not provide a disease-specific dashboard. This project implemented a dashboard for displaying all diabetes-related lab measures at one place, when a physician accesses a patient record in OpenMRS. It summarizes a list of diabetes-related clinical measures through an intuitive, chart-based, customizable user experience. Gauge charts are used to display the most important lab values for Glucose, Renal Function, and Lipid Profile tests. Data Tables are used to display data of the lab values from the past and current visit in the table, including the ability to search for a specific visit date. Interactive line charts are used to display the trends of lab measures. Diabetes Dashboard may help physicians to make quicker decisions through this snapshot view. We took data for a few patients and demonstrated this to clinicians as a proof of concept, without performing a full-fledged user evaluation. Future work involves integrating this dashboard with clinical practice guidelines and alerting when measures are outside the guidelines.
AIFeb 23
Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare BenchmarkLalitha Pranathi Pulavarthy, Raajitha Muthyala, Aravind V Kuruvikkattil et al.
Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.
CVApr 22, 2025
Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive AnalysisFrank Li, Hari Trivedi, Bardia Khosravi et al.
Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.
CVJan 20, 2024
DengueNet: Dengue Prediction using Spatiotemporal Satellite Imagery for Resource-Limited CountriesKuan-Ting Kuo, Dana Moukheiber, Sebastian Cajas Ordonez et al.
Dengue fever presents a substantial challenge in developing countries where sanitation infrastructure is inadequate. The absence of comprehensive healthcare systems exacerbates the severity of dengue infections, potentially leading to life-threatening circumstances. Rapid response to dengue outbreaks is also challenging due to limited information exchange and integration. While timely dengue outbreak forecasts have the potential to prevent such outbreaks, the majority of dengue prediction studies have predominantly relied on data that impose significant burdens on individual countries for collection. In this study, our aim is to improve health equity in resource-constrained countries by exploring the effectiveness of high-resolution satellite imagery as a nontraditional and readily accessible data source. By leveraging the wealth of publicly available and easily obtainable satellite imagery, we present a scalable satellite extraction framework based on Sentinel Hub, a cloud-based computing platform. Furthermore, we introduce DengueNet, an innovative architecture that combines Vision Transformer, Radiomics, and Long Short-term Memory to extract and integrate spatiotemporal features from satellite images. This enables dengue predictions on an epi-week basis. To evaluate the effectiveness of our proposed method, we conducted experiments on five municipalities in Colombia. We utilized a dataset comprising 780 high-resolution Sentinel-2 satellite images for training and evaluation. The performance of DengueNet was assessed using the mean absolute error (MAE) metric. Across the five municipalities, DengueNet achieved an average MAE of 43.92. Our findings strongly support the efficacy of satellite imagery as a valuable resource for dengue prediction, particularly in informing public health policies within countries where manually collected data is scarce and dengue virus prevalence is severe.
IVDec 27, 2021
MedShift: identifying shift data for medical dataset curationXiaoyuan Guo, Judy Wawira Gichoya, Hari Trivedi et al.
To curate a high-quality dataset, identifying data variance between the internal and external sources is a fundamental and crucial step. However, methods to detect shift or variance in data have not been significantly researched. Challenges to this are the lack of effective approaches to learn dense representation of a dataset and difficulties of sharing private data across medical institutions. To overcome the problems, we propose a unified pipeline called MedShift to detect the top-level shift samples and thus facilitate the medical curation. Given an internal dataset A as the base source, we first train anomaly detectors for each class of dataset A to learn internal distributions in an unsupervised way. Second, without exchanging data across sources, we run the trained anomaly detectors on an external dataset B for each class. The data samples with high anomaly scores are identified as shift data. To quantify the shiftness of the external dataset, we cluster B's data into groups class-wise based on the obtained scores. We then train a multi-class classifier on A and measure the shiftness with the classifier's performance variance on B by gradually dropping the group with the largest anomaly score for each class. Additionally, we adapt a dataset quality metric to help inspect the distribution differences for multiple medical sources. We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source. Experiments show our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently. An interface introduction video to visualize our results is available at https://youtu.be/V3BF0P1sxQE.
CVJul 31, 2021
Margin-Aware Intra-Class Novelty Identification for Medical ImagesXiaoyuan Guo, Judy Wawira Gichoya, Saptarshi Purkayastha et al.
Traditional anomaly detection methods focus on detecting inter-class variations while medical image novelty identification is inherently an intra-class detection problem. For example, a machine learning model trained with normal chest X-ray and common lung abnormalities, is expected to discover and flag idiopathic pulmonary fibrosis which a rare lung disease and unseen by the model during training. The nuances from intra-class variations and lack of relevant training data in medical image analysis pose great challenges for existing anomaly detection methods. To tackle the challenges, we propose a hybrid model - Transformation-based Embedding learning for Novelty Detection (TEND) which without any out-of-distribution training data, performs novelty identification by combining both autoencoder-based and classifier-based method. With a pre-trained autoencoder as image feature extractor, TEND learns to discriminate the feature embeddings of in-distribution data from the transformed counterparts as fake out-of-distribution inputs. To enhance the separation, a distance objective is optimized to enforce a margin between the two classes. Extensive experimental results on both natural image datasets and medical image datasets are presented and our method out-performs state-of-the-art approaches.
CVJul 21, 2021
Reading Race: AI Recognises Patient's Racial Identity In Medical ImagesImon Banerjee, Ananth Reddy Bhimireddy, John L. Burns et al.
Background: In medical imaging, prior studies have demonstrated disparate AI performance by race, yet there is no known correlation for race on medical imaging that would be obvious to the human expert interpreting the images. Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect race from medical images, including the ability of these models to generalize to external environments and across multiple imaging modalities, B) assessment of possible confounding anatomic and phenotype population features, such as disease distribution and body habitus as predictors of race, and C) investigation into the underlying mechanism by which AI models can recognize race. Findings: Standard deep learning models can be trained to predict race from medical images with high performance across multiple imaging modalities. Our findings hold under external validation conditions, as well as when models are optimized to perform clinically motivated tasks. We demonstrate this detection is not due to trivial proxies or imaging-related surrogate covariates for race, such as underlying disease distribution. Finally, we show that performance persists over all anatomical regions and frequency spectrum of the images suggesting that mitigation efforts will be challenging and demand further study. Interpretation: We emphasize that model ability to predict self-reported race is itself not the issue of importance. However, our findings that AI can trivially predict self-reported race -- even from corrupted, cropped, and noised medical images -- in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.
SPFeb 28, 2021
Human Activity Recognition using Deep Learning Models on Smartphones and Smartwatches Sensor DataBolu Oluwalade, Sunil Neela, Judy Wawira et al.
In recent years, human activity recognition has garnered considerable attention both in industrial and academic research because of the wide deployment of sensors, such as accelerometers and gyroscopes, in products such as smartphones and smartwatches. Activity recognition is currently applied in various fields where valuable information about an individual's functional ability and lifestyle is needed. In this study, we used the popular WISDM dataset for activity recognition. Using multivariate analysis of covariance (MANCOVA), we established a statistically significant difference (p<0.05) between the data generated from the sensors embedded in smartphones and smartwatches. By doing this, we show that smartphones and smartwatches don't capture data in the same way due to the location where they are worn. We deployed several neural network architectures to classify 15 different hand and non-hand-oriented activities. These models include Long short-term memory (LSTM), Bi-directional Long short-term memory (BiLSTM), Convolutional Neural Network (CNN), and Convolutional LSTM (ConvLSTM). The developed models performed best with watch accelerometer data. Also, we saw that the classification precision obtained with the convolutional input classifiers (CNN and ConvLSTM) was higher than the end-to-end LSTM classifier in 12 of the 15 activities. Additionally, the CNN model for the watch accelerometer was better able to classify non-hand oriented activities when compared to hand-oriented activities.
CRFeb 23, 2021
Usability and Security of Different Authentication Methods for an Electronic Health Records SystemSaptarshi Purkayastha, Shreya Goyal, Bolu Oluwalade et al.
We conducted a survey of 67 graduate students enrolled in the Privacy and Security in Healthcare course at Indiana University Purdue University Indianapolis. This was done to measure user preference and their understanding of usability and security of three different Electronic Health Records authentication methods: single authentication method (username and password), Single sign-on with Central Authentication Service (CAS) authentication method, and a bio-capsule facial authentication method. This research aims to explore the relationship between security and usability, and measure the effect of perceived security on usability in these three aforementioned authentication methods. We developed a formative-formative Partial Least Square Structural Equation Modeling (PLS-SEM) model to measure the relationship between the latent variables of Usability, and Security. The measurement model was developed using five observed variables (measures). - Efficiency and Effectiveness, Satisfaction, Preference, Concerns, and Confidence. The results obtained highlight the importance and impact of these measures on the latent variables and the relationship among the latent variables. From the PLS-SEM analysis, it was found that security has a positive impact on usability for Single sign-on and bio-capsule facial authentication methods. We conclude that the facial authentication method was the most secure and usable among the three authentication methods. Further, descriptive analysis was done to draw out the interesting findings from the survey regarding the observed variables.
CRDec 23, 2020
Enabling Secure and Effective Biomedical Data Sharing through Cyberinfrastructure GatewaysShreya Goyal, Saptarshi Purkayastha, Tyler Phillips et al.
Dynaswap project reports on developing a coherently integrated and trustworthy holistic secure workflow protection architecture for cyberinfrastructures which can be used on virtual machines deployed through cyberinfrastructure (CI) services such as JetStream. This service creates a user-friendly cloud environment designed to give researchers access to interactive computing and data analysis resources on demand. The Dynaswap cybersecurity architecture supports roles, role hierarchies, and data hierarchies, as well as dynamic changes of roles and hierarchical relations within the scientific infrastructure. Dynaswap combines existing cutting-edge security frameworks (including an Authentication Authorization-Accounting framework, Multi-Factor Authentication, Secure Digital Provenance, and Blockchain) with advanced security tools (e.g., Biometric-Capsule, Cryptography-based Hierarchical Access Control, and Dual-level Key Management). The CI is being validated in life-science research environments and in the education settings of Health Informatics.
IVJun 23, 2020
Was there COVID-19 back in 2012? Challenge for AI in Diagnosis with Similar IndicationsImon Banerjee, Priyanshu Sinha, Saptarshi Purkayastha et al.
Purpose: Since the recent COVID-19 outbreak, there has been an avalanche of research papers applying deep learning based image processing to chest radiographs for detection of the disease. To test the performance of the two top models for CXR COVID-19 diagnosis on external datasets to assess model generalizability. Methods: In this paper, we present our argument regarding the efficiency and applicability of existing deep learning models for COVID-19 diagnosis. We provide results from two popular models - COVID-Net and CoroNet evaluated on three publicly available datasets and an additional institutional dataset collected from EMORY Hospital between January and May 2020, containing patients tested for COVID-19 infection using RT-PCR. Results: There is a large false positive rate (FPR) for COVID-Net on both ChexPert (55.3%) and MIMIC-CXR (23.4%) dataset. On the EMORY Dataset, COVID-Net has 61.4% sensitivity, 0.54 F1-score and 0.49 precision value. The FPR of the CoroNet model is significantly lower across all the datasets as compared to COVID-Net - EMORY(9.1%), ChexPert (1.3%), ChestX-ray14 (0.02%), MIMIC-CXR (0.06%). Conclusion: The models reported good to excellent performance on their internal datasets, however we observed from our testing that their performance dramatically worsened on external data. This is likely from several causes including overfitting models due to lack of appropriate control patients and ground truth labels. The fourth institutional dataset was labeled using RT-PCR, which could be positive without radiographic findings and vice versa. Therefore, a fusion model of both clinical and radiographic data may have better performance and generalization.
IVApr 16, 2020
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ImagesPradeeban Kathiravelu, Puneet Sharma, Ashish Sharma et al.
Executing machine learning (ML) pipelines in real-time on radiology images is hard due to the limited computing resources in clinical environments and the lack of efficient data transfer capabilities to run them on research clusters. We propose Niffler, an integrated framework that enables the execution of ML pipelines at research clusters by efficiently querying and retrieving radiology images from the Picture Archiving and Communication Systems (PACS) of the hospitals. Niffler uses the Digital Imaging and Communications in Medicine (DICOM) protocol to fetch and store imaging data and provides metadata extraction capabilities and Application programming interfaces (APIs) to apply filters on the images. Niffler further enables the sharing of the outcomes from the ML pipelines in a de-identified manner. Niffler has been running stable for more than 19 months and has supported several research projects at the department. In this paper, we present its architecture and three of its use cases: an inferior vena cava (IVC) filter detection from the images in real-time, identification of scanner utilization, and scanner clock calibration. Evaluations on the Niffler prototype highlight its feasibility and efficiency in facilitating the ML pipelines on the images and metadata in real-time and retrospectively.
CLMar 17, 2020
Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notesA. K. Bhavani Singh, Mounika Guntu, Ananth Reddy Bhimireddy et al.
In the United States, 25% or greater than 200 billion dollars of hospital spending accounts for administrative costs that involve services for medical coding and billing. With the increasing number of patient records, manual assignment of the codes performed is overwhelming, time-consuming and error-prone, causing billing errors. Natural language processing can automate the extraction of codes/labels from unstructured clinical notes, which can aid human coders to save time, increase productivity, and verify medical coding errors. Our objective is to identify appropriate diagnosis and procedure codes from clinical notes by performing multi-label classification. We used de-identified data of critical care patients from the MIMIC-III database and subset the data to select the ten (top-10) and fifty (top-50) most common diagnoses and procedures, which covers 47.45% and 74.12% of all admissions respectively. We implemented state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) to fine-tune the language model on 80% of the data and validated on the remaining 20%. The model achieved an overall accuracy of 87.08%, an F1 score of 85.82%, and an AUC of 91.76% for top-10 codes. For the top-50 codes, our model achieved an overall accuracy of 93.76%, an F1 score of 92.24%, and AUC of 91%. When compared to previously published research, our model outperforms in predicting codes from the clinical text. We discuss approaches to generalize the knowledge discovery process of our MIMIC-BERT to other clinical notes. This can help human coders to save time, prevent backlogs, and additional costs due to coding errors.
CLDec 28, 2019
Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networksSiddhartha Nuthakki, Sunil Neela, Judy W. Gichoya et al.
Coding diagnosis and procedures in medical records is a crucial process in the healthcare industry, which includes the creation of accurate billings, receiving reimbursements from payers, and creating standardized patient care records. In the United States, Billing and Insurance related activities cost around $471 billion in 2012 which constitutes about 25% of all the U.S hospital spending. In this paper, we report the performance of a natural language processing model that can map clinical notes to medical codes, and predict final diagnosis from unstructured entries of history of present illness, symptoms at the time of admission, etc. Previous studies have demonstrated that deep learning models perform better at such mapping when compared to conventional machine learning models. Therefore, we employed state-of-the-art deep learning method, ULMFiT on the largest emergency department clinical notes dataset MIMIC III which has 1.2M clinical notes to select for the top-10 and top-50 diagnosis and procedure codes. Our models were able to predict the top-10 diagnoses and procedures with 80.3% and 80.5% accuracy, whereas the top-50 ICD-9 codes of diagnosis and procedures are predicted with 70.7% and 63.9% accuracy. Prediction of diagnosis and procedures from unstructured clinical notes benefit human coders to save time, eliminate errors and minimize costs. With promising scores from our present model, the next step would be to deploy this on a small-scale real-world scenario and compare it with human coders as the gold standard. We believe that further research of this approach can create highly accurate predictions that can ease the workflow in a clinical setting.