MIMIC-IV-Ext-PE: Using a large language model to predict pulmonary embolism phenotype in the MIMIC-IV dataset
This work addresses the need for large labeled datasets in hematologic research, specifically for PE, but is incremental as it applies an existing model to new data.
The researchers tackled the problem of limited large publicly available datasets for pulmonary embolism (PE) research by using a finetuned Bio_ClinicalBERT model (VTE-BERT) to automatically label PE from radiology reports in the MIMIC-IV dataset, achieving a sensitivity of 92.4% and PPV of 87.8% on 19,942 patients, and adding nearly 20,000 labels to the dataset.
Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality. Advances in diagnosis, risk stratification, and prevention can improve outcomes. There are few large publicly available datasets that contain PE labels for research. Using the MIMIC-IV database, we extracted all available radiology reports of computed tomography pulmonary angiography (CTPA) scans and two physicians manually labeled the results as PE positive (acute PE) or PE negative. We then applied a previously finetuned Bio_ClinicalBERT transformer language model, VTE-BERT, to extract labels automatically. We verified VTE-BERT's reliability by measuring its performance against manual adjudication. We also compared the performance of VTE-BERT to diagnosis codes. We found that VTE-BERT has a sensitivity of 92.4% and positive predictive value (PPV) of 87.8% on all 19,942 patients with CTPA radiology reports from the emergency room and/or hospital admission. In contrast, diagnosis codes have a sensitivity of 95.4% and PPV of 83.8% on the subset of 11,990 hospitalized patients with discharge diagnosis codes. We successfully add nearly 20,000 labels to CTPAs in a publicly available dataset and demonstrate the external validity of a semi-supervised language model in accelerating hematologic research.