CLApr 14, 2023Code
MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training DataTianyu Han, Lisa C. Adams, Jens-Michalis Papaioannou et al.
As large language models (LLMs) like OpenAI's GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.
IVNov 7, 2022Code
Medical Diffusion: Denoising Diffusion Probabilistic Models for 3D Medical Image GenerationFiras Khader, Gustav Mueller-Franzes, Soroosh Tayebi Arasteh et al.
Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play a crucial role in privacy preserving artificial intelligence and can also be used to augment small datasets. Here we show that diffusion probabilistic models can synthesize high quality medical imaging data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography (CT) images. We provide quantitative measurements of their performance through a reader study with two medical experts who rated the quality of the synthesized images in three categories: Realistic image appearance, anatomical correctness and consistency between slices. Furthermore, we demonstrate that synthetic images can be used in a self-supervised pre-training and improve the performance of breast segmentation models when data is scarce (dice score 0.91 vs. 0.95 without vs. with synthetic data). The code is publicly available on GitHub: https://github.com/FirasGit/medicaldiffusion.
IVDec 14, 2022
Diffusion Probabilistic Models beat GANs on Medical ImagesGustav Müller-Franzes, Jan Moritz Niehues, Firas Khader et al.
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
LGAug 27, 2023
Large Language Models Streamline Automated Machine Learning for Clinical StudiesSoroosh Tayebi Arasteh, Tianyu Han, Mahshad Lotfinia et al.
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (P>0.071). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.
CLAug 25, 2024
Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical DataFelix J. Dorfner, Amin Dada, Felix Busch et al.
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks. We evaluated their performance on clinical case challenges from the New England Journal of Medicine (NEJM) and the Journal of the American Medical Association (JAMA) and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). Using benchmarks specifically chosen to be likely outside the fine-tuning datasets of biomedical models, we found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on medical knowledge. While larger models showed similar performance on case tasks (e.g., OpenBioLLM-70B: 66.4% vs. Llama-3-70B-Instruct: 65% on JAMA cases), smaller biomedical models showed more pronounced underperformance (e.g., OpenBioLLM-8B: 30% vs. Llama-3-8B-Instruct: 64.3% on NEJM cases). Similar trends were observed across the CLUE (Clinical Language Understanding Evaluation) benchmark tasks, with general-purpose models often performing better on text generation, question answering, and coding tasks. Our results suggest that fine-tuning LLMs to biomedical data may not provide the expected benefits and may potentially lead to reduced performance, challenging prevailing assumptions about domain-specific adaptation of LLMs and highlighting the need for more rigorous evaluation frameworks in healthcare AI. Alternative approaches, such as retrieval-augmented generation, may be more effective in enhancing the biomedical capabilities of LLMs without compromising their general knowledge.
IVNov 24, 2023
From Text to Image: Exploring GPT-4Vision's Potential in Advanced Radiological Analysis across SubspecialtiesFelix Busch, Tianyu Han, Marcus Makowski et al.
The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
CVJan 15Code
Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image SegmentationChongcong Jiang, Tianxingjian Ding, Chuhan Song et al.
Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt-driven medical image segmentation, obtained by fully fine-tuning SAM3 on large-scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground-truth-derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine-tuning SAM3's model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain-specific representations while preserving prompt-driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long-range 3D context. Our results establish Medical SAM3 as a universal, text-guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt-driven segmentation under severe domain shift. Code and model will be made available at https://github.com/AIM-Research-Lab/Medical-SAM3.
LGDec 18, 2022
Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate DiagnosisFiras Khader, Gustav Mueller-Franzes, Tianci Wang et al.
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of "learnable synergies", in which the model only selects relevant interactions between data modalities and keeps an "internal memory" of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.
LGSep 29, 2023
Medical Foundation Models are Susceptible to Targeted Misinformation AttacksTianyu Han, Sven Nebelung, Firas Khader et al.
Large language models (LLMs) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. In this study, we demonstrate a concerning vulnerability of LLMs in medicine. Through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. The erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. We validate our findings in a set of 1,038 incorrect biomedical facts. This peculiar susceptibility raises serious security and trustworthiness concerns for the application of LLMs in healthcare settings. It accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
CVSep 29, 2023
Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative PretrainingTianyu Han, Laura Žigutytė, Luisa Huck et al.
Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dataset of 515,704 chest radiographs from 194,956 patients from multiple healthcare centers in the United States and Europe. DiffChest explains classifications on a patient-specific level and visualizes the confounding factors that may mislead the model. We found high inter-reader agreement when evaluating DiffChest's capability to identify treatment-related confounders, with Fleiss' Kappa values of 0.8 or higher across most imaging findings. Confounders were accurately captured with 11.1% to 100% prevalence rates. Furthermore, our pretraining process optimized the model to capture the most relevant information from the input radiographs. DiffChest achieved excellent diagnostic accuracy when diagnosing 11 chest conditions, such as pleural effusion and cardiac insufficiency, and at least sufficient diagnostic accuracy for the remaining conditions. Our findings highlight the potential of pretraining based on diffusion models in medical image classification, specifically in providing insights into confounding factors and model robustness.
PLOct 7, 2022
Novice Type Error Diagnosis with Natural Language ModelsChuqin Geng, Haolin Ye, Yixuan Li et al.
Strong static type systems help programmers eliminate many errors without much burden of supplying type annotations. However, this flexibility makes it highly non-trivial to diagnose ill-typed programs, especially for novice programmers. Compared to classic constraint solving and optimization-based approaches, the data-driven approach has shown great promise in identifying the root causes of type errors with higher accuracy. Instead of relying on hand-engineered features, this work explores natural language models for type error localization, which can be trained in an end-to-end fashion without requiring any features. We demonstrate that, for novice type error diagnosis, the language model-based approach significantly outperforms the previous state-of-the-art data-driven approach. Specifically, our model could predict type errors correctly 62% of the time, outperforming the state-of-the-art Nate's data-driven model by 11%, in a more rigorous accuracy metric. Furthermore, we also apply structural probes to explain the performance difference between different language models.
SYNov 14, 2025
Further Results on Safety-Critical Stabilization of Force-Controlled Nonholonomic Mobile RobotsBo Wang, Tianyu Han, Guangwei Wang
In this paper, we address the stabilization problem for force-controlled nonholonomic mobile robots under safety-critical constraints. We propose a continuous, time-invariant control law based on the gamma m-quadratic programming (gamma m-QP) framework, which unifies control Lyapunov functions (CLFs) and control barrier functions (CBFs) to enforce both stability and safety in the closed-loop system. For the first time, we construct a global, time-invariant, strict Lyapunov function for the closed-loop nonholonomic mobile robot full-dynamic system with a nominal stabilization controller in polar coordinates; this strict Lyapunov function then serves as the CLF in the QP design. Next, by exploiting the inherent cascaded structure of the vehicle dynamics, we develop a CBF for the mobile robot via an integrator backstepping procedure. Our main results guarantee both asymptotic stability and safety for the closed-loop system. Both the simulation and experimental results are presented to illustrate the effectiveness and performance of our approach.
CLJan 25, 2024Code
LongHealth: A Question Answering Benchmark with Long Clinical DocumentsLisa Adams, Felix Busch, Tianyu Han et al.
Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation. Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.
80.9SYMar 17
Eliminating Persistent Boundary Residence via Matrosov-Type Auxiliary FunctionsTianyu Han, Guangwei Wang, Bo Wang
Control barrier functions enforce safety by guaranteeing forward invariance of an admissible set. Under standard (non-strict) barrier conditions, however, forward invariance alone does not prevent trajectories from remaining on the boundary of the safe set for arbitrarily long time intervals, potentially leading to boundary sticking or deadlock phenomena. This paper studies the elimination of persistent boundary residence under forward-invariant barrier conditions. Inspired by Matrosov-type arguments, we introduce an auxiliary function framework that preserves forward invariance while excluding infinite-time residence within boundary layers. Sufficient conditions are established under which any trajectory can only remain in a prescribed neighborhood of the boundary for finite time, thereby restoring boundary-level liveness without altering forward invariance. The proposed construction does not rely on singular barrier formulations or controller-specific modifications, and can be incorporated into standard safety-critical control architectures. Numerical examples illustrate the removal of boundary sticking behaviors while maintaining safety across representative systems.
IVNov 24, 2024
Medical Slice Transformer: Improved Diagnosis and Explainability on 3D Medical Images with DINOv2Gustav Müller-Franzes, Firas Khader, Robert Siepmann et al.
MRI and CT are essential clinical cross-sectional imaging techniques for diagnosing complex conditions. However, large 3D datasets with annotations for deep learning are scarce. While methods like DINOv2 are encouraging for 2D image analysis, these methods have not been applied to 3D medical images. Furthermore, deep learning models often lack explainability due to their "black-box" nature. This study aims to extend 2D self-supervised models, specifically DINOv2, to 3D medical imaging while evaluating their potential for explainable outcomes. We introduce the Medical Slice Transformer (MST) framework to adapt 2D self-supervised models for 3D medical image analysis. MST combines a Transformer architecture with a 2D feature extractor, i.e., DINOv2. We evaluate its diagnostic performance against a 3D convolutional neural network (3D ResNet) across three clinical datasets: breast MRI (651 patients), chest CT (722 patients), and knee MRI (1199 patients). Both methods were tested for diagnosing breast cancer, predicting lung nodule dignity, and detecting meniscus tears. Diagnostic performance was assessed by calculating the Area Under the Receiver Operating Characteristic Curve (AUC). Explainability was evaluated through a radiologist's qualitative comparison of saliency maps based on slice and lesion correctness. P-values were calculated using Delong's test. MST achieved higher AUC values compared to ResNet across all three datasets: breast (0.94$\pm$0.01 vs. 0.91$\pm$0.02, P=0.02), chest (0.95$\pm$0.01 vs. 0.92$\pm$0.02, P=0.13), and knee (0.85$\pm$0.04 vs. 0.69$\pm$0.05, P=0.001). Saliency maps were consistently more precise and anatomically correct for MST than for ResNet. Self-supervised 2D models like DINOv2 can be effectively adapted for 3D medical imaging using MST, offering enhanced diagnostic accuracy and explainability compared to convolutional neural networks.
IVJun 23, 2024
On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging ReconstructionTianyu Han, Sven Nebelung, Firas Khader et al.
Denoising diffusion models offer a promising approach to accelerating magnetic resonance imaging (MRI) and producing diagnostic-level images in an unsupervised manner. However, our study demonstrates that even tiny worst-case potential perturbations transferred from a surrogate model can cause these models to generate fake tissue structures that may mislead clinicians. The transferability of such worst-case perturbations indicates that the robustness of image reconstruction may be compromised due to MR system imperfections or other sources of noise. Moreover, at larger perturbation strengths, diffusion models exhibit Gaussian noise-like artifacts that are distinct from those observed in supervised models and are more challenging to detect. Our results highlight the vulnerability of current state-of-the-art diffusion-based reconstruction models to possible worst-case perturbations and underscore the need for further research to improve their robustness and reliability in clinical settings.
CVJun 3, 2024
Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence NormalizationFiras Khader, Omar S. M. El Nahhas, Tianyu Han et al.
The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity relative to the sequence length, which constrains its application to longer sequences. This is especially crucial in medical imaging where high-resolution images can reach gigapixel scale. Efforts to address this issue have predominantely focused on complex techniques, such as decomposing the softmax operation integral to the Transformer's architecture. This paper addresses this quadratic computational complexity of Transformer models and introduces a remarkably simple and effective method that circumvents this issue by eliminating the softmax function from the attention mechanism and adopting a sequence normalization technique for the key, query, and value tokens. Coupled with a reordering of matrix multiplications this approach reduces the memory- and compute complexity to a linear scale. We evaluate this approach across various medical imaging datasets comprising fundoscopic, dermascopic, radiologic and histologic imaging data. Our findings highlight that these models exhibit a comparable performance to traditional transformer models, while efficiently handling longer sequences.
IVMay 11, 2023
Transformers for CT Reconstruction From Monoplanar and Biplanar RadiographsFiras Khader, Gustav Müller-Franzes, Tianyu Han et al.
Computed Tomography (CT) scans provide detailed and accurate information of internal structures in the body. They are constructed by sending x-rays through the body from different directions and combining this information into a three-dimensional volume. Such volumes can then be used to diagnose a wide range of conditions and allow for volumetric measurements of organs. In this work, we tackle the problem of reconstructing CT images from biplanar x-rays only. X-rays are widely available and even if the CT reconstructed from these radiographs is not a replacement of a complete CT in the diagnostic setting, it might serve to spare the patients from radiation where a CT is only acquired for rough measurements such as determining organ size. We propose a novel method based on the transformer architecture, by framing the underlying task as a language translation problem. Radiographs and CT images are first embedded into latent quantized codebook vectors using two different autoencoder networks. We then train a GPT model, to reconstruct the codebook vectors of the CT image, conditioned on the codebook vectors of the x-rays and show that this approach leads to realistic looking images. To encourage further research in this direction, we make our code publicly available on GitHub: XXX.
CVMay 11, 2023
Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image Classification Using TransformersFiras Khader, Jakob Nikolas Kather, Tianyu Han et al.
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen. An automated analysis of such images using deep learning models is therefore of high demand. The transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information. Here, the whole-slide image is partitioned into smaller image patches and feature tokens are extracted from these image patches. However, while the conventional transformer allows for a simultaneous processing of a large set of input tokens, the computational demand scales quadratically with the number of input tokens and thus quadratically with the number of image patches. To address this problem we propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches. Our experiments demonstrate that this architecture is at least on-par with and even outperforms other attention-based state-of-the-art methods on two public datasets: On the use-case of lung cancer (TCGA NSCLC) our model reaches a mean area under the receiver operating characteristic (AUC) of 0.970 $\pm$ 0.008 and on renal cancer (TCGA RCC) reaches a mean AUC of 0.985 $\pm$ 0.004. Furthermore, we show that our proposed model is efficient in low-data regimes, making it a promising approach for analyzing whole-slide images in resource-limited settings. To foster research in this direction, we make our code publicly available on GitHub: XXX.
IVNov 22, 2021
Image prediction of disease progression by style-based manifold extrapolationTianyu Han, Jakob Nikolas Kather, Federico Pedersoli et al.
Disease-modifying management aims to prevent deterioration and progression of the disease, not just relieve symptoms. Unfortunately, the development of necessary therapies is often hampered by the failure to recognize the presymptomatic disease and limited understanding of disease development. We present a generic solution for this problem by a methodology that allows the prediction of progression risk and morphology in individuals using a latent extrapolation optimization approach. To this end, we combined a regularized generative adversarial network (GAN) and a latent nearest neighbor algorithm for joint optimization to generate plausible images of future time points. We evaluated our method on osteoarthritis (OA) data from a multi-center longitudinal study (the Osteoarthritis Initiative, OAI). With presymptomatic baseline data, our model is generative and significantly outperforms the end-to-end learning model in discriminating the progressive cohort. Two experiments were performed with seven experienced radiologists. When no synthetic follow-up radiographs were provided, our model performed better than all seven radiologists. In cases where the synthetic follow-ups generated by our model were available, the specificity and sensitivity of all readers in discriminating progressors increased from $72.3\%$ to $88.6\%$ and from $42.1\%$ to $51.6\%$, respectively. Our results open up a new possibility of using model-based morphology and risk prediction to make predictions about future disease occurrence, as demonstrated in the example of OA.
LGNov 25, 2020
Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalizationTianyu Han, Sven Nebelung, Federico Pedersoli et al.
Unmasking the decision-making process of machine learning models is essential for implementing diagnostic support systems in clinical practice. Here, we demonstrate that adversarially trained models can significantly enhance the usability of pathology detection as compared to their standard counterparts. We let six experienced radiologists rate the interpretability of saliency maps in datasets of X-rays, computed tomography, and magnetic resonance imaging scans. Significant improvements were found for our adversarial models, which could be further improved by the application of dual batch normalization. Contrary to previous research on adversarially trained models, we found that the accuracy of such models was equal to standard models when sufficiently large datasets and dual batch norm training were used. To ensure transferability, we additionally validated our results on an external test set of 22,433 X-rays. These findings elucidate that different paths for adversarial and real images are needed during training to achieve state of the art results with superior clinical interpretability.