LGJun 28, 2024Code
ACES: Automatic Cohort Extraction System for Event-Stream DatasetsJustin Xu, Jack Gallifant, Alistair E. W. Johnson et al.
Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.
CLJun 12, 2025
BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLPThomas Sounack, Joshua Davis, Brigitte Durieux et al.
Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.
CVJan 21, 2019
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographsAlistair E. W. Johnson, Tom J. Pollard, Nathaniel R. Greenbaum et al.
Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirely from the MIMIC-CXR database, and aims to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in medical computer vision.
LGDec 6, 2018
Generalizability of predictive models for intensive care unit patientsAlistair E. W. Johnson, Tom J. Pollard, Tristan Naumann
A large volume of research has considered the creation of predictive models for clinical data; however, much existing literature reports results using only a single source of data. In this work, we evaluate the performance of models trained on the publicly-available eICU Collaborative Research Database. We show that cross-validation using many distinct centers provides a reasonable estimate of model performance in new centers. We further show that a single model trained across centers transfers well to distinct hospitals, even compared to a model retrained using hospital-specific data. Our results motivate the use of multi-center datasets for model development and highlight the need for data sharing among hospitals to maximize model performance.
LGSep 11, 2017
False arrhythmia alarm reduction in the intensive care unitAndrea S. Li, Alistair E. W. Johnson, Roger G. Mark
Research has shown that false alarms constitute more than 80% of the alarms triggered in the intensive care unit (ICU). The high false arrhythmia alarm rate has severe implications such as disruption of patient care, caregiver alarm fatigue, and desensitization from clinical staff to real life-threatening alarms. A method to reduce the false alarm rate would therefore greatly benefit patients as well as nurses in their ability to provide care. We here develop and describe a robust false arrhythmia alarm reduction system for use in the ICU. Building off of work previously described in the literature, we make use of signal processing and machine learning techniques to identify true and false alarms for five arrhythmia types. This baseline algorithm alone is able to perform remarkably well, with a sensitivity of 0.908, a specificity of 0.838, and a PhysioNet/CinC challenge score of 0.756. We additionally explore dynamic time warping techniques on both the entire alarm signal as well as on a beat-by-beat basis in an effort to improve performance of ventricular tachycardia, which has in the literature been one of the hardest arrhythmias to classify. Such an algorithm with strong performance and efficiency could potentially be translated for use in the ICU to promote overall patient care and recovery.