CLApr 14, 2023Code
MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training DataTianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou et al.
As large language models (LLMs) like OpenAI's GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.
CVSep 27, 2022
What Does DALL-E 2 Know About Radiology?Lisa C. Adams, Felix Busch, Daniel Truhn et al.
Generative models such as DALL-E 2 could represent a promising future tool for image generation, augmentation, and manipulation for artificial intelligence research in radiology provided that these models have sufficient medical domain knowledge. Here we show that DALL-E 2 has learned relevant representations of X-ray images with promising capabilities in terms of zero-shot text-to-image generation of new images, continuation of an image beyond its original boundaries, or removal of elements, while pathology generation or CT, MRI, and ultrasound images are still limited. The use of generative models for augmenting and generating radiological data thus seems feasible, even if further fine-tuning and adaptation of these models to the respective domain is required beforehand.
CLMar 14, 2023
MEDBERT.de: A Comprehensive German BERT Model for the Medical DomainKeno K. Bressem, Jens-Michalis Papaioannou, Paul Grundmann et al.
This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.
CLAug 25, 2024
Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical DataFelix J. Dorfner, Amin Dada, Felix Busch et al.
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks. We evaluated their performance on clinical case challenges from the New England Journal of Medicine (NEJM) and the Journal of the American Medical Association (JAMA) and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). Using benchmarks specifically chosen to be likely outside the fine-tuning datasets of biomedical models, we found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on medical knowledge. While larger models showed similar performance on case tasks (e.g., OpenBioLLM-70B: 66.4% vs. Llama-3-70B-Instruct: 65% on JAMA cases), smaller biomedical models showed more pronounced underperformance (e.g., OpenBioLLM-8B: 30% vs. Llama-3-8B-Instruct: 64.3% on NEJM cases). Similar trends were observed across the CLUE (Clinical Language Understanding Evaluation) benchmark tasks, with general-purpose models often performing better on text generation, question answering, and coding tasks. Our results suggest that fine-tuning LLMs to biomedical data may not provide the expected benefits and may potentially lead to reduced performance, challenging prevailing assumptions about domain-specific adaptation of LLMs and highlighting the need for more rigorous evaluation frameworks in healthcare AI. Alternative approaches, such as retrieval-augmented generation, may be more effective in enhancing the biomedical capabilities of LLMs without compromising their general knowledge.
LGSep 29, 2023
Medical Foundation Models are Susceptible to Targeted Misinformation AttacksTianyu Han, Sven Nebelung, Firas Khader et al.
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. The erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. We validate our findings in a set of 1,038 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
CLFeb 19, 2024Code
Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray ReportsFelix J. Dorfner, Liv Jürgensen, Leonhard Donle et al.
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques. Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively. Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.
IVMay 10, 2024Code
MRSegmentator: Multi-Modality Segmentation of 40 Classes in MRI and CTHartmut Häntze, Lina Xu, Christian J. Mertens et al.
Purpose: To develop and evaluate a deep learning model for multi-organ segmentation of MRI scans. Materials and Methods: The model was trained on 1,200 manually annotated 3D axial MRI scans from the UK Biobank, 221 in-house MRI scans, and 1228 CT scans from the TotalSegmentator dataset. A human-in-the-loop annotation workflow was employed, leveraging cross-modality transfer learning from an existing CT segmentation model to segment 40 anatomical structures. The annotation process began with a model based on transfer learning between CT and MR, which was iteratively refined based on manual corrections to predicted segmentations. The model's performance was evaluated on MRI examinations obtained from the German National Cohort (NAKO) study (n=900) from the AMOS22 dataset (n=60) and from the TotalSegmentator-MRI test data (n=29). The Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) were used to assess segmentation quality, stratified by organ and scan type. The model and its weights will be open-sourced. Results: MRSegmentator demonstrated high accuracy for well-defined organs (lungs: DSC 0.96, heart: DSC 0.94) and organs with anatomic variability (liver: DSC 0.96, kidneys: DSC 0.95). Smaller structures showed lower accuracy (portal/splenic veins: DSC 0.64, adrenal glands: DSC 0.69). On external validation using NAKO data, mean DSC ranged from 0.85 $\pm$ 0.08 for T2-HASTE to 0.91 $\pm$ 0.05 for in-phase sequences. The model generalized well to CT, achieving mean DSC of 0.84 $\pm$ 0.11 on AMOS CT data. Conclusion: MRSegmentator accurately segments 40 anatomical structures in MRI across diverse datasets and imaging protocols, with additional generalizability to CT images. This open-source model will provide a valuable tool for automated multi-organ segmentation in medical imaging research. It can be downloaded from https://github.com/hhaentze/MRSegmentator.
CVMay 12, 2025Code
Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based FrameworkSarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh et al.
Kidney abnormality segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and abnormalities, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated kidney abnormality segmentation algorithm, made publicly available for clinical and research use. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.
CVMay 12, 2024Code
Incorporating Anatomical Awareness for Enhanced Generalizability and Progression Prediction in Deep Learning-Based Radiographic Sacroiliitis DetectionFelix J. Dorfner, Janis L. Vahldiek, Leonhard Donle et al.
Purpose: To examine whether incorporating anatomical awareness into a deep learning model can improve generalizability and enable prediction of disease progression. Methods: This retrospective multicenter study included conventional pelvic radiographs of 4 different patient cohorts focusing on axial spondyloarthritis (axSpA) collected at university and community hospitals. The first cohort, which consisted of 1483 radiographs, was split into training (n=1261) and validation (n=222) sets. The other cohorts comprising 436, 340, and 163 patients, respectively, were used as independent test datasets. For the second cohort, follow-up data of 311 patients was used to examine progression prediction capabilities. Two neural networks were trained, one on images cropped to the bounding box of the sacroiliac joints (anatomy-aware) and the other one on full radiographs. The performance of the models was compared using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. Results: On the three test datasets, the standard model achieved AUC scores of 0.853, 0.817, 0.947, with an accuracy of 0.770, 0.724, 0.850. Whereas the anatomy-aware model achieved AUC scores of 0.899, 0.846, 0.957, with an accuracy of 0.821, 0.744, 0.906, respectively. The patients who were identified as high risk by the anatomy aware model had an odds ratio of 2.16 (95% CI: 1.19, 3.86) for having progression of radiographic sacroiliitis within 2 years. Conclusion: Anatomical awareness can improve the generalizability of a deep learning model in detecting radiographic sacroiliitis. The model is published as fully open source alongside this study.
CLOct 13, 2025Code
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal PerturbationsJohannes Moll, Markus Graf, Tristan Lemke et al.
Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $τ_b=0.670$), moderate alignment for fidelity ($τ_b=0.387$), and weak alignment for confidence tone ($τ_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality can be decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.
86.2AIApr 27
Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensusJohannes Moll, Jannik Lübberstedt, Christoph Nuernbergk et al.
Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.
CLMay 30, 2025
Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMsJuraj Vladika, Annika Domres, Mai Nguyen et al.
Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
IVMay 4, 2024
Improve Cross-Modality Segmentation by Treating T1-Weighted MRI Images as Inverted CT ScansHartmut Häntze, Lina Xu, Maximilian Rattunde et al.
Computed tomography (CT) segmentation models often contain classes that are not currently supported by magnetic resonance imaging (MRI) segmentation models. In this study, we show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data. We demonstrate the feasibility for both a general multi-class and a specific renal carcinoma model for segmenting T1-weighted MRI images. Using this technique, we were able to localize and segment clear cell renal cell carcinoma in T1-weighted MRI scans, using a model that was trained on only CT data. Image inversion is straightforward to implement and does not require dedicated graphics processing units, thus providing a quick alternative to complex deep modality-transfer models. Our results demonstrate that existing CT models, including pathology models, might be transferable to the MRI domain with reasonable effort.
CVSep 24, 2025
Sex-based Bias Inherent in the Dice Similarity Coefficient: A Model Independent Analysis for Multiple Anatomical StructuresHartmut Häntze, Myrthe Buser, Alessa Hering et al.
Overlap-based metrics such as the Dice Similarity Coefficient (DSC) penalize segmentation errors more heavily in smaller structures. As organ size differs by sex, this implies that a segmentation error of equal magnitude may result in lower DSCs in women due to their smaller average organ volumes compared to men. While previous work has examined sex-based differences in models or datasets, no study has yet investigated the potential bias introduced by the DSC itself. This study quantifies sex-based differences of the DSC and the normalized DSC in an idealized setting independent of specific models. We applied equally-sized synthetic errors to manual MRI annotations from 50 participants to ensure sex-based comparability. Even minimal errors (e.g., a 1 mm boundary shift) produced systematic DSC differences between sexes. For small structures, average DSC differences were around 0.03; for medium-sized structures around 0.01. Only large structures (i.e., lungs and liver) were mostly unaffected, with sex-based DSC differences close to zero. These findings underline that fairness studies using the DSC as an evaluation metric should not expect identical scores between men and women, as the metric itself introduces bias. A segmentation model may perform equally well across sexes in terms of error magnitude, even if observed DSC values suggest otherwise. Importantly, our work raises awareness of a previously underexplored source of sex-based differences in segmentation performance. One that arises not from model behavior, but from the metric itself. Recognizing this factor is essential for more accurate and fair evaluations in medical image analysis.
IVJan 25, 2021
3D U-Net for segmentation of COVID-19 associated pulmonary infiltrates using transfer learning: State-of-the-art results on affordable hardwareKeno K. Bressem, Stefan M. Niehues, Bernd Hamm et al.
Segmentation of pulmonary infiltrates can help assess severity of COVID-19, but manual segmentation is labor and time-intensive. Using neural networks to segment pulmonary infiltrates would enable automation of this task. However, training a 3D U-Net from computed tomography (CT) data is time- and resource-intensive. In this work, we therefore developed and tested a solution on how transfer learning can be used to train state-of-the-art segmentation models on limited hardware and in shorter time. We use the recently published RSNA International COVID-19 Open Radiology Database (RICORD) to train a fully three-dimensional U-Net architecture using an 18-layer 3D ResNet, pretrained on the Kinetics-400 dataset as encoder. The generalization of the model was then tested on two openly available datasets of patients with COVID-19, who received chest CTs (Corona Cases and MosMed datasets). Our model performed comparable to previously published 3D U-Net architectures, achieving a mean Dice score of 0.679 on the tuning dataset, 0.648 on the Coronacases dataset and 0.405 on the MosMed dataset. Notably, these results were achieved with shorter training time on a single GPU with less memory available than the GPUs used in previous studies.
LGFeb 20, 2020
Comparing Different Deep Learning Architectures for Classification of Chest RadiographsKeno K. Bressem, Lisa Adams, Christoph Erxleben et al.
Chest radiographs are among the most frequently acquired images in radiology and are often the subject of computer vision research. However, most of the models used to classify chest radiographs are derived from openly available deep neural networks, trained on large image-datasets. These datasets routinely differ from chest radiographs in that they are mostly color images and contain several possible image classes, while radiographs are greyscale images and often only contain fewer image classes. Therefore, very deep neural networks, which can represent more complex relationships in image-features, might not be required for the comparatively simpler task of classifying grayscale chest radiographs. We compared fifteen different architectures of artificial neural networks regarding training-time and performance on the openly available CheXpert dataset to identify the most suitable models for deep learning tasks on chest radiographs. We could show, that smaller networks such as ResNet-34, AlexNet or VGG-16 have the potential to classify chest radiographs as precisely as deeper neural networks such as DenseNet-201 or ResNet-151, while being less computationally demanding.