AIJan 15Code
MHub.ai: A Simple, Standardized, and Reproducible Platform for AI Models in Medical ImagingLeonard Nürnberg, Dennis Bontempi, Suraj Pai et al.
Artificial intelligence (AI) has the potential to transform medical imaging by automating image analysis and accelerating clinical research. However, research and clinical use are limited by the wide variety of AI implementations and architectures, inconsistent documentation, and reproducibility issues. Here, we introduce MHub.ai, an open-source, container-based platform that standardizes access to AI models with minimal configuration, promoting accessibility and reproducibility in medical imaging. MHub.ai packages models from peer-reviewed publications into standardized containers that support direct processing of DICOM and other formats, provide a unified application interface, and embed structured metadata. Each model is accompanied by publicly available reference data that can be used to confirm model operation. MHub.ai includes an initial set of state-of-the-art segmentation, prediction, and feature extraction models for different modalities. The modular framework enables adaptation of any model and supports community contributions. We demonstrate the utility of the platform in a clinical use case through comparative evaluation of lung segmentation models. To further strengthen transparency and reproducibility, we publicly release the generated segmentations and evaluation metrics and provide interactive dashboards that allow readers to inspect individual cases and reproduce or extend our analysis. By simplifying model use, MHub.ai enables side-by-side benchmarking with identical execution commands and standardized outputs, and lowers the barrier to clinical translation.
CLAug 25, 2024
Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical DataFelix J. Dorfner, Amin Dada, Felix Busch et al.
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks. We evaluated their performance on clinical case challenges from the New England Journal of Medicine (NEJM) and the Journal of the American Medical Association (JAMA) and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). Using benchmarks specifically chosen to be likely outside the fine-tuning datasets of biomedical models, we found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on medical knowledge. While larger models showed similar performance on case tasks (e.g., OpenBioLLM-70B: 66.4% vs. Llama-3-70B-Instruct: 65% on JAMA cases), smaller biomedical models showed more pronounced underperformance (e.g., OpenBioLLM-8B: 30% vs. Llama-3-8B-Instruct: 64.3% on NEJM cases). Similar trends were observed across the CLUE (Clinical Language Understanding Evaluation) benchmark tasks, with general-purpose models often performing better on text generation, question answering, and coding tasks. Our results suggest that fine-tuning LLMs to biomedical data may not provide the expected benefits and may potentially lead to reduced performance, challenging prevailing assumptions about domain-specific adaptation of LLMs and highlighting the need for more rigorous evaluation frameworks in healthcare AI. Alternative approaches, such as retrieval-augmented generation, may be more effective in enhancing the biomedical capabilities of LLMs without compromising their general knowledge.
CLFeb 19, 2024Code
Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray ReportsFelix J. Dorfner, Liv Jürgensen, Leonhard Donle et al.
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques. Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively. Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.
IVMay 10, 2024Code
MRSegmentator: Multi-Modality Segmentation of 40 Classes in MRI and CTHartmut Häntze, Lina Xu, Christian J. Mertens et al.
Purpose: To develop and evaluate a deep learning model for multi-organ segmentation of MRI scans. Materials and Methods: The model was trained on 1,200 manually annotated 3D axial MRI scans from the UK Biobank, 221 in-house MRI scans, and 1228 CT scans from the TotalSegmentator dataset. A human-in-the-loop annotation workflow was employed, leveraging cross-modality transfer learning from an existing CT segmentation model to segment 40 anatomical structures. The annotation process began with a model based on transfer learning between CT and MR, which was iteratively refined based on manual corrections to predicted segmentations. The model's performance was evaluated on MRI examinations obtained from the German National Cohort (NAKO) study (n=900) from the AMOS22 dataset (n=60) and from the TotalSegmentator-MRI test data (n=29). The Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) were used to assess segmentation quality, stratified by organ and scan type. The model and its weights will be open-sourced. Results: MRSegmentator demonstrated high accuracy for well-defined organs (lungs: DSC 0.96, heart: DSC 0.94) and organs with anatomic variability (liver: DSC 0.96, kidneys: DSC 0.95). Smaller structures showed lower accuracy (portal/splenic veins: DSC 0.64, adrenal glands: DSC 0.69). On external validation using NAKO data, mean DSC ranged from 0.85 $\pm$ 0.08 for T2-HASTE to 0.91 $\pm$ 0.05 for in-phase sequences. The model generalized well to CT, achieving mean DSC of 0.84 $\pm$ 0.11 on AMOS CT data. Conclusion: MRSegmentator accurately segments 40 anatomical structures in MRI across diverse datasets and imaging protocols, with additional generalizability to CT images. This open-source model will provide a valuable tool for automated multi-organ segmentation in medical imaging research. It can be downloaded from https://github.com/hhaentze/MRSegmentator.
CVDec 15, 2025
DBT-DINO: Towards Foundation model based analysis of Digital Breast TomosynthesisFelix J. Dorfner, Manon A. Dorster, Ryan Connolly et al.
Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.
CVMay 12, 2024Code
Incorporating Anatomical Awareness for Enhanced Generalizability and Progression Prediction in Deep Learning-Based Radiographic Sacroiliitis DetectionFelix J. Dorfner, Janis L. Vahldiek, Leonhard Donle et al.
Purpose: To examine whether incorporating anatomical awareness into a deep learning model can improve generalizability and enable prediction of disease progression. Methods: This retrospective multicenter study included conventional pelvic radiographs of 4 different patient cohorts focusing on axial spondyloarthritis (axSpA) collected at university and community hospitals. The first cohort, which consisted of 1483 radiographs, was split into training (n=1261) and validation (n=222) sets. The other cohorts comprising 436, 340, and 163 patients, respectively, were used as independent test datasets. For the second cohort, follow-up data of 311 patients was used to examine progression prediction capabilities. Two neural networks were trained, one on images cropped to the bounding box of the sacroiliac joints (anatomy-aware) and the other one on full radiographs. The performance of the models was compared using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. Results: On the three test datasets, the standard model achieved AUC scores of 0.853, 0.817, 0.947, with an accuracy of 0.770, 0.724, 0.850. Whereas the anatomy-aware model achieved AUC scores of 0.899, 0.846, 0.957, with an accuracy of 0.821, 0.744, 0.906, respectively. The patients who were identified as high risk by the anatomy aware model had an odds ratio of 2.16 (95% CI: 1.19, 3.86) for having progression of radiographic sacroiliitis within 2 years. Conclusion: Anatomical awareness can improve the generalizability of a deep learning model in detecting radiographic sacroiliitis. The model is published as fully open source alongside this study.
IVMay 4, 2024
Improve Cross-Modality Segmentation by Treating T1-Weighted MRI Images as Inverted CT ScansHartmut Häntze, Lina Xu, Maximilian Rattunde et al.
Computed tomography (CT) segmentation models often contain classes that are not currently supported by magnetic resonance imaging (MRI) segmentation models. In this study, we show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data. We demonstrate the feasibility for both a general multi-class and a specific renal carcinoma model for segmenting T1-weighted MRI images. Using this technique, we were able to localize and segment clear cell renal cell carcinoma in T1-weighted MRI scans, using a model that was trained on only CT data. Image inversion is straightforward to implement and does not require dedicated graphics processing units, thus providing a quick alternative to complex deep modality-transfer models. Our results demonstrate that existing CT models, including pathology models, might be transferable to the MRI domain with reasonable effort.
IVAug 31, 2025
Towards Early Detection: AI-Based Five-Year Forecasting of Breast Cancer Risk Using Digital Breast Tomosynthesis ImagingManon A. Dorster, Felix J. Dorfner, Mason C. Cleveland et al.
As early detection of breast cancer strongly favors successful therapeutic outcomes, there is major commercial interest in optimizing breast cancer screening. However, current risk prediction models achieve modest performance and do not incorporate digital breast tomosynthesis (DBT) imaging, which was FDA-approved for breast cancer screening in 2011. To address this unmet need, we present a deep learning (DL)-based framework capable of forecasting an individual patient's 5-year breast cancer risk directly from screening DBT. Using an unparalleled dataset of 161,753 DBT examinations from 50,590 patients, we trained a risk predictor based on features extracted using the Meta AI DINOv2 image encoder, combined with a cumulative hazard layer, to assess a patient's likelihood of developing breast cancer over five years. On a held-out test set, our best-performing model achieved an AUROC of 0.80 on predictions within 5 years. These findings reveal the high potential of DBT-based DL approaches to complement traditional risk assessment tools, and serve as a promising basis for additional investigation to validate and enhance our work.
IVOct 17, 2024
Scalable Drift Monitoring in Medical Imaging AIJameson Merkow, Felix J. Dorfner, Xiyu Yang et al.
The integration of artificial intelligence (AI) into medical imaging has advanced clinical diagnostics but poses challenges in managing model drift and ensuring long-term reliability. To address these challenges, we develop MMC+, an enhanced framework for scalable drift monitoring, building upon the CheXstray framework that introduced real-time drift detection for medical imaging AI models using multi-modal data concordance. This work extends the original framework's methodologies, providing a more scalable and adaptable solution for real-world healthcare settings and offers a reliable and cost-effective alternative to continuous performance monitoring addressing limitations of both continuous and periodic monitoring methods. MMC+ introduces critical improvements to the original framework, including more robust handling of diverse data streams, improved scalability with the integration of foundation models like MedImageInsight for high-dimensional image embeddings without site-specific training, and the introduction of uncertainty bounds to better capture drift in dynamic clinical environments. Validated with real-world data from Massachusetts General Hospital during the COVID-19 pandemic, MMC+ effectively detects significant data shifts and correlates them with model performance changes. While not directly predicting performance degradation, MMC+ serves as an early warning system, indicating when AI systems may deviate from acceptable performance bounds and enabling timely interventions. By emphasizing the importance of monitoring diverse data streams and evaluating data shifts alongside model performance, this work contributes to the broader adoption and integration of AI solutions in clinical settings.