IVAug 5, 2022
Adversarial Robustness of MR Image Reconstruction under Realistic PerturbationsJan Nikolas Morshuis, Sergios Gatidis, Matthias Hein et al.
Deep Learning (DL) methods have shown promising results for solving ill-posed inverse problems such as MR image reconstruction from undersampled $k$-space data. However, these approaches currently have no guarantees for reconstruction quality and the reliability of such algorithms is only poorly understood. Adversarial attacks offer a valuable tool to understand possible failure modes and worst case performance of DL-based reconstruction algorithms. In this paper we describe adversarial attacks on multi-coil $k$-space measurements and evaluate them on the recently proposed E2E-VarNet and a simpler UNet-based model. In contrast to prior work, the attacks are targeted to specifically alter diagnostically relevant regions. Using two realistic attack models (adversarial $k$-space noise and adversarial rotations) we are able to show that current state-of-the-art DL-based reconstruction algorithms are indeed sensitive to such perturbations to a degree where relevant diagnostic information may be lost. Surprisingly, in our experiments the UNet and the more sophisticated E2E-VarNet were similarly sensitive to such attacks. Our findings add further to the evidence that caution must be exercised as DL-based methods move closer to clinical practice.
IVJan 9, 2023
Multiscale Metamorphic VAE for 3D Brain MRI SynthesisJaivardhan Kapoor, Jakob H. Macke, Christian F. Baumgartner
Generative modeling of 3D brain MRIs presents difficulties in achieving high visual fidelity while ensuring sufficient coverage of the data distribution. In this work, we propose to address this challenge with composable, multiscale morphological transformations in a variational autoencoder (VAE) framework. These transformations are applied to a chosen reference brain image to generate MRI volumes, equipping the model with strong anatomical inductive biases. We structure the VAE latent space in a way such that the model covers the data distribution sufficiently well. We show substantial performance improvements in FID while retaining comparable, or superior, reconstruction quality compared to prior work based on VAEs and generative adversarial networks (GANs).
LGMar 8, 2023
Deep Hypothesis Tests Detect Clinically Relevant Subgroup Shifts in Medical ImagesLisa M. Koch, Christian M. Schürch, Christian F. Baumgartner et al.
Distribution shifts remain a fundamental problem for the safe application of machine learning systems. If undetected, they may impact the real-world performance of such systems or will at least render original performance claims invalid. In this paper, we focus on the detection of subgroup shifts, a type of distribution shift that can occur when subgroups have a different prevalence during validation compared to the deployment setting. For example, algorithms developed on data from various acquisition settings may be predominantly applied in hospitals with lower quality data acquisition, leading to an inadvertent performance drop. We formulate subgroup shift detection in the framework of statistical hypothesis testing and show that recent state-of-the-art statistical tests can be effectively applied to subgroup shift detection on medical imaging data. We provide synthetic experiments as well as extensive evaluation on clinically meaningful subgroup shifts on histopathology as well as retinal fundus images. We conclude that classifier-based subgroup shift detection tests could be a particularly useful tool for post-market surveillance of deployed ML systems.
LGJul 23, 2023
Right for the Wrong Reason: Can Interpretable ML Techniques Detect Spurious Correlations?Susu Sun, Lisa M. Koch, Christian F. Baumgartner
While deep neural network models offer unmatched classification performance, they are prone to learning spurious correlations in the data. Such dependencies on confounding information can be difficult to detect using performance metrics if the test data comes from the same distribution as the training data. Interpretable ML methods such as post-hoc explanations or inherently interpretable classifiers promise to identify faulty model reasoning. However, there is mixed evidence whether many of these techniques are actually able to do so. In this paper, we propose a rigorous evaluation strategy to assess an explanation technique's ability to correctly identify spurious correlations. Using this strategy, we evaluate five post-hoc explanation techniques and one inherently interpretable method for their ability to detect three types of artificially added confounders in a chest x-ray diagnosis task. We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance and can be used to reliably identify faulty model behavior.
CVMar 1, 2023
Inherently Interpretable Multi-Label Classification Using Class-Specific CounterfactualsSusu Sun, Stefano Woerner, Andreas Maier et al.
Interpretability is essential for machine learning algorithms in high-stakes application fields such as medical image analysis. However, high-performing black-box neural networks do not provide explanations for their predictions, which can lead to mistrust and suboptimal human-ML collaboration. Post-hoc explanation techniques, which are widely used in practice, have been shown to suffer from severe conceptual problems. Furthermore, as we show in this paper, current explanation techniques do not perform adequately in the multi-label scenario, in which multiple medical findings may co-occur in a single image. We propose Attri-Net, an inherently interpretable model for multi-label classification. Attri-Net is a powerful classifier that provides transparent, trustworthy, and human-understandable explanations. The model first generates class-specific attribution maps based on counterfactuals to identify which image regions correspond to certain medical findings. Then a simple logistic regression classifier is used to make predictions based solely on these attribution maps. We compare Attri-Net to five post-hoc explanation techniques and one inherently interpretable classifier on three chest X-ray datasets. We find that Attri-Net produces high-quality multi-label explanations consistent with clinical knowledge and has comparable classification performance to state-of-the-art classification models.
IVAug 4, 2023
Uncertainty Estimation and Propagation in Accelerated MRI ReconstructionPaul Fischer, Thomas Küstner, Christian F. Baumgartner
MRI reconstruction techniques based on deep learning have led to unprecedented reconstruction quality especially in highly accelerated settings. However, deep learning techniques are also known to fail unexpectedly and hallucinate structures. This is particularly problematic if reconstructions are directly used for downstream tasks such as real-time treatment guidance or automated extraction of clinical paramters (e.g. via segmentation). Well-calibrated uncertainty quantification will be a key ingredient for safe use of this technology in clinical practice. In this paper we propose a novel probabilistic reconstruction technique (PHiRec) building on the idea of conditional hierarchical variational autoencoders. We demonstrate that our proposed method produces high-quality reconstructions as well as uncertainty quantification that is substantially better calibrated than several strong baselines. We furthermore demonstrate how uncertainties arising in the MR econstruction can be propagated to a downstream segmentation task, and show that PHiRec also allows well-calibrated estimation of segmentation uncertainties that originated in the MR reconstruction process.
IVJul 25, 2024Code
Segmentation-guided MRI reconstruction for meaningfully diverse reconstructionsJan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner
Inverse problems, such as accelerated MRI reconstruction, are ill-posed and an infinite amount of possible and plausible solutions exist. This may not only lead to uncertainty in the reconstructed image but also in downstream tasks such as semantic segmentation. This uncertainty, however, is mostly not analyzed in the literature, even though probabilistic reconstruction models are commonly used. These models can be prone to ignore plausible but unlikely solutions like rare pathologies. Building on MRI reconstruction approaches based on diffusion models, we add guidance to the diffusion process during inference, generating two meaningfully diverse reconstructions corresponding to an upper and lower bound segmentation. The reconstruction uncertainty can then be quantified by the difference between these bounds, which we coin the 'uncertainty boundary'. We analyzed the behavior of the upper and lower bound segmentations for a wide range of acceleration factors and found the uncertainty boundary to be both more reliable and more accurate compared to repeated sampling. Code is available at https://github.com/NikolasMorshuis/SGR
LGJul 11, 2024
Subgroup-Specific Risk-Controlled Dose Estimation in RadiotherapyPaul Fischer, Hannah Willms, Moritz Schneider et al.
Cancer remains a leading cause of death, highlighting the importance of effective radiotherapy (RT). Magnetic resonance-guided linear accelerators (MR-Linacs) enable imaging during RT, allowing for inter-fraction, and perhaps even intra-fraction, adjustments of treatment plans. However, achieving this requires fast and accurate dose calculations. While Monte Carlo simulations offer accuracy, they are computationally intensive. Deep learning frameworks show promise, yet lack uncertainty quantification crucial for high-risk applications like RT. Risk-controlling prediction sets (RCPS) offer model-agnostic uncertainty quantification with mathematical guarantees. However, we show that naive application of RCPS may lead to only certain subgroups such as the image background being risk-controlled. In this work, we extend RCPS to provide prediction intervals with coverage guarantees for multiple subgroups with unknown subgroup membership at test time. We evaluate our algorithm on real clinical planing volumes from five different anatomical regions and show that our novel subgroup RCPS (SG-RCPS) algorithm leads to prediction intervals that jointly control the risk for multiple subgroups. In particular, our method controls the risk of the crucial voxels along the radiation beam significantly better than conventional RCPS.
IVJul 18, 2024
Conformal Performance Range Prediction for Segmentation Output Quality ControlAnna M. Wundram, Paul Fischer, Michael Muehlebach et al.
Recent works have introduced methods to estimate segmentation performance without ground truth, relying solely on neural network softmax outputs. These techniques hold potential for intuitive output quality control. However, such performance estimates rely on calibrated softmax outputs, which is often not the case in modern neural networks. Moreover, the estimates do not take into account inherent uncertainty in segmentation tasks. These limitations may render precise performance predictions unattainable, restricting the practical applicability of performance estimation methods. To address these challenges, we develop a novel approach for predicting performance ranges with statistical guarantees of containing the ground truth with a user specified probability. Our method leverages sampling-based segmentation uncertainty estimation to derive heuristic performance ranges, and applies split conformal prediction to transform these estimates into rigorous prediction ranges that meet the desired guarantees. We demonstrate our approach on the FIVES retinal vessel segmentation dataset and compare five commonly used sampling-based uncertainty estimation techniques. Our results show that it is possible to achieve the desired coverage with small prediction ranges, highlighting the potential of performance range prediction as a valuable tool for output quality control.
CVFeb 6, 2023
Studying Therapy Effects and Disease Outcomes in Silico using Artificial Counterfactual Tissue SamplesMartin Paulikat, Christian M. Schürch, Christian F. Baumgartner
Understanding the interactions of different cell types inside the immune tumor microenvironment (iTME) is crucial for the development of immunotherapy treatments as well as for predicting their outcomes. Highly multiplexed tissue imaging (HMTI) technologies offer a tool which can capture cell properties of tissue samples by measuring expression of various proteins and storing them in separate image channels. HMTI technologies can be used to gain insights into the iTME and in particular how the iTME differs for different patient outcome groups of interest (e.g., treatment responders vs. non-responders). Understanding the systematic differences in the iTME of different patient outcome groups is crucial for developing better treatments and personalising existing treatments. However, such analyses are inherently limited by the fact that any two tissue samples vary due to a large number of factors unrelated to the outcome. Here, we present CF-HistoGAN, a machine learning framework that employs generative adversarial networks (GANs) to create artificial counterfactual tissue samples that resemble the original tissue samples as closely as possible but capture the characteristics of a different patient outcome group. Specifically, we learn to "translate" HMTI samples from one patient group to create artificial paired samples. We show that this approach allows to directly study the effects of different patient outcomes on the iTMEs of individual tissue samples. We demonstrate that CF-HistoGAN can be employed as an explorative tool for understanding iTME effects on the pixel level. Moreover, we show that our method can be used to identify statistically significant differences in the expression of different proteins between patient groups with greater sensitivity compared to conventional approaches.
CVAug 15, 2024
Navigating Data Scarcity using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical ImagingStefano Woerner, Christian F. Baumgartner
Data scarcity is a major limiting factor for applying modern machine learning techniques to clinical tasks. Although sufficient data exists for some well-studied medical tasks, there remains a long tail of clinically relevant tasks with poor data availability. Recently, numerous foundation models have demonstrated high suitability for few-shot learning (FSL) and zero-shot learning (ZSL), potentially making them more accessible to practitioners. However, it remains unclear which foundation model performs best on FSL medical image analysis tasks and what the optimal methods are for learning from limited data. We conducted a comprehensive benchmark study of ZSL and FSL using 16 pretrained foundation models on 19 diverse medical imaging datasets. Our results indicate that BiomedCLIP, a model pretrained exclusively on medical data, performs best on average for very small training set sizes, while very large CLIP models pretrained on LAION-2B perform best with slightly more training samples. However, simply fine-tuning a ResNet-18 pretrained on ImageNet performs similarly with more than five training examples per class. Our findings also highlight the need for further research on foundation models specifically tailored for medical applications and the collection of more datasets to train these models.
IVJul 1, 2025Code
Mind the Detail: Uncovering Clinically Relevant Image Details in Accelerated MRI with Semantically Diverse ReconstructionsJan Nikolas Morshuis, Christian Schlarmann, Thomas Küstner et al.
In recent years, accelerated MRI reconstruction based on deep learning has led to significant improvements in image quality with impressive results for high acceleration factors. However, from a clinical perspective image quality is only secondary; much more important is that all clinically relevant information is preserved in the reconstruction from heavily undersampled data. In this paper, we show that existing techniques, even when considering resampling for diffusion-based reconstruction, can fail to reconstruct small and rare pathologies, thus leading to potentially wrong diagnosis decisions (false negatives). To uncover the potentially missing clinical information we propose ``Semantically Diverse Reconstructions'' (\SDR), a method which, given an original reconstruction, generates novel reconstructions with enhanced semantic variability while all of them are fully consistent with the measured data. To evaluate \SDR automatically we train an object detector on the fastMRI+ dataset. We show that \SDR significantly reduces the chance of false-negative diagnoses (higher recall) and improves mean average precision compared to the original reconstructions. The code is available on https://github.com/NikolasMorshuis/SDR
CVMar 11, 2025Code
Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image ClassificationSusu Sun, Dominique van Midden, Geert Litjens et al.
Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model's decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that offers user-friendly explanations and supports human intervention. Our approach employs a sparse autoencoder to discover human-interpretable concepts from the image feature space, which are then used to train ProtoMIL. The model represents predictions as linear combinations of concepts, making the decision process transparent. Furthermore, ProtoMIL allows users to perform model interventions by altering the input concepts. Experiments on two widely used pathology datasets demonstrate that ProtoMIL achieves a classification performance comparable to state-of-the-art MIL models while offering intuitively understandable explanations. Moreover, we demonstrate that our method can eliminate reliance on diagnostically irrelevant information via human intervention, guiding the model toward being right for the right reason. Code will be publicly available at https://github.com/ss-sun/ProtoMIL.
CVMar 19
Rethinking Uncertainty Quantification and Entanglement in Image SegmentationJakob Lønborg Christensen, Vedrana Andersen Dahl, Morten Rieger Hannemose et al.
Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.
CVJul 15, 2024
PULPo: Probabilistic Unsupervised Laplacian Pyramid RegistrationLeonard Siegert, Paul Fischer, Mattias P. Heinrich et al.
Deformable image registration is fundamental to many medical imaging applications. Registration is an inherently ambiguous task often admitting many viable solutions. While neural network-based registration techniques enable fast and accurate registration, the majority of existing approaches are not able to estimate uncertainty. Here, we present PULPo, a method for probabilistic deformable registration capable of uncertainty quantification. PULPo probabilistically models the distribution of deformation fields on different hierarchical levels combining them using Laplacian pyramids. This allows our method to model global as well as local aspects of the deformation field. We evaluate our method on two widely used neuroimaging datasets and find that it achieves high registration performance as well as substantially better calibrated uncertainty quantification compared to the current state-of-the-art.
IVJul 9, 2020Code
Semi-supervised Task-driven Data Augmentation for Medical Image SegmentationKrishna Chaitanya, Neerav Karani, Christian F. Baumgartner et al.
Supervised learning-based segmentation methods typically require a large number of annotated training data to generalize well at test time. In medical applications, curating such datasets is not a favourable option because acquiring a large number of annotated samples from experts is time-consuming and expensive. Consequently, numerous methods have been proposed in the literature for learning with limited annotated examples. Unfortunately, the proposed approaches in the literature have not yet yielded significant gains over random data augmentation for image segmentation, where random augmentations themselves do not yield high accuracy. In this work, we propose a novel task-driven data augmentation method for learning with limited labeled data where the synthetic data generator, is optimized for the segmentation task. The generator of the proposed method models intensity and shape variations using two sets of transformations, as additive intensity transformations and deformation fields. Both transformations are optimized using labeled as well as unlabeled examples in a semi-supervised framework. Our experiments on three medical datasets, namely cardic, prostate and pancreas, show that the proposed approach significantly outperforms standard augmentation and semi-supervised approaches for image segmentation in the limited annotation setting. The code is made publicly available at https://github.com/krishnabits001/task$\_$driven$\_$data$\_$augmentation.
LGFeb 16
Universal Algorithm-Implicit LearningStefano Woerner, Seong Joon Oh, Christian F. Baumgartner
Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like "universal" and "general-purpose" inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20$\times$ more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.
CVDec 1, 2025
Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias AnalysisAlexander Frotscher, Christian F. Baumgartner, Thomas Wolfers
Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
CVApr 24, 2024
A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-datasetStefano Woerner, Arthur Jaques, Christian F. Baumgartner
While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.
CVJan 6, 2025
Label-free Concept Based Multiple Instance Learning for Gigapixel HistopathologySusu Sun, Leslie Tessier, Frédérique Meeuwsen et al.
Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide Image (WSI) analysis with only slide-level annotations. Interpretability is crucial for safely deploying such algorithms in high-stakes medical domains. Traditional MIL methods offer explanations by highlighting salient regions. However, such spatial heatmaps provide limited insights for end users. To address this, we propose a novel inherently interpretable WSI-classification approach that uses human-understandable pathology concepts to generate explanations. Our proposed Concept MIL model leverages recent advances in vision-language models to directly predict pathology concepts based on image features. The model's predictions are obtained through a linear combination of the concepts identified on the top-K patches of a WSI, enabling inherent explanations by tracing each concept's influence on the prediction. In contrast to traditional concept-based interpretable models, our approach eliminates the need for costly human annotations by leveraging the vision-language model. We validate our method on two widely used pathology datasets: Camelyon16 and PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9, putting it on par with state-of-the-art models. We further find that 87.1\% (Camelyon16) and 85.3\% (PANDA) of the top 20 patches fall within the tumor region. A user study shows that the concepts identified by our model align with the concepts used by pathologists, making it a promising strategy for human-interpretable WSI classification.
CVDec 4, 2023
Unsupervised Anomaly Detection using Aggregated Normative DiffusionAlexander Frotscher, Jaivardhan Kapoor, Thomas Wolfers et al.
Early detection of anomalies in medical images such as brain MRI is highly relevant for diagnosis and treatment of many conditions. Supervised machine learning methods are limited to a small number of pathologies where there is good availability of labeled data. In contrast, unsupervised anomaly detection (UAD) has the potential to identify a broader spectrum of anomalies by spotting deviations from normal patterns. Our research demonstrates that existing state-of-the-art UAD approaches do not generalise well to diverse types of anomalies in realistic multi-modal MR data. To overcome this, we introduce a new UAD method named Aggregated Normative Diffusion (ANDi). ANDi operates by aggregating differences between predicted denoising steps and ground truth backwards transitions in Denoising Diffusion Probabilistic Models (DDPMs) that have been trained on pyramidal Gaussian noise. We validate ANDi against three recent UAD baselines, and across three diverse brain MRI datasets. We show that ANDi, in some cases, substantially surpasses these baselines and shows increased robustness to varying types of anomalies. Particularly in detecting multiple sclerosis (MS) lesions, ANDi achieves improvements of up to 178% in terms of AUPRC.
IVAug 26, 2025
Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI DataJan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner
MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.
CVAug 4, 2025
Is Uncertainty Quantification a Viable Alternative to Learned Deferral?Anna M. Wundram, Christian F. Baumgartner
Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models' ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model's confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.
LGMar 13, 2025
Subgroup Performance Analysis in Hidden StratificationsAlceu Bissoto, Trung-Dung Hoang, Tim Flühmann et al.
Machine learning (ML) models may suffer from significant performance disparities between patient groups. Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient. Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g., patient sex) sufficiently reflects the main reasons for performance variability, which is not common. Subgroup discovery techniques that identify cohesive subgroups based on learned feature representations appear as a potential solution: They could expose hidden stratifications and provide more granular subgroup performance reports. However, subgroup discovery is challenging to evaluate even as a standalone task, as ground truth stratification labels do not exist in real data. Subgroup discovery has thus neither been applied nor evaluated for the application of subgroup performance monitoring. Here, we apply subgroup discovery for performance monitoring in chest x-ray and skin lesion classification. We propose novel evaluation strategies and show that a simplified subgroup discovery method without access to classification labels or metadata can expose larger performance disparities than traditional metadata-based subgroup analysis. We provide the first compelling evidence that subgroup discovery can serve as an important tool for comprehensive performance validation and monitoring of trustworthy AI in medicine.
CVJun 8, 2024
Attri-Net: A Globally and Locally Inherently Interpretable Model for Multi-Label Classification Using Class-Specific CounterfactualsSusu Sun, Stefano Woerner, Andreas Maier et al.
Interpretability is crucial for machine learning algorithms in high-stakes medical applications. However, high-performing neural networks typically cannot explain their predictions. Post-hoc explanation methods provide a way to understand neural networks but have been shown to suffer from conceptual problems. Moreover, current research largely focuses on providing local explanations for individual samples rather than global explanations for the model itself. In this paper, we propose Attri-Net, an inherently interpretable model for multi-label classification that provides local and global explanations. Attri-Net first counterfactually generates class-specific attribution maps to highlight the disease evidence, then performs classification with logistic regression classifiers based solely on the attribution maps. Local explanations for each prediction can be obtained by interpreting the attribution maps weighted by the classifiers' weights. Global explanation of whole model can be obtained by jointly considering learned average representations of the attribution maps for each class (called the class centers) and the weights of the linear classifiers. To ensure the model is ``right for the right reason", we further introduce a mechanism to guide the model's explanations to align with human knowledge. Our comprehensive evaluations show that Attri-Net can generate high-quality explanations consistent with clinical knowledge while not sacrificing classification performance.
IVSep 30, 2020
Sampling possible reconstructions of undersampled acquisitions in MR imagingKerem C. Tezcan, Neerav Karani, Christian F. Baumgartner et al.
Undersampling the k-space during MR acquisitions saves time, however results in an ill-posed inversion problem, leading to an infinite set of images as possible solutions. Traditionally, this is tackled as a reconstruction problem by searching for a single "best" image out of this solution set according to some chosen regularization or prior. This approach, however, misses the possibility of other solutions and hence ignores the uncertainty in the inversion process. In this paper, we propose a method that instead returns multiple images which are possible under the acquisition model and the chosen prior to capture the uncertainty in the inversion process. To this end, we introduce a low dimensional latent space and model the posterior distribution of the latent vectors given the acquisition data in k-space, from which we can sample in the latent space and obtain the corresponding images. We use a variational autoencoder for the latent model and the Metropolis adjusted Langevin algorithm for the sampling. We evaluate our method on two datasets; with images from the Human Connectome Project and in-house measured multi-coil images. We compare to five alternative methods. Results indicate that the proposed method produces images that match the measured k-space data better than the alternatives, while showing realistic structural variability. Furthermore, in contrast to the compared methods, the proposed method yields higher uncertainty in the undersampled phase encoding direction, as expected. Keywords: Magnetic Resonance image reconstruction, uncertainty estimation, inverse problems, sampling, MCMC, deep learning, unsupervised learning.
IVJan 31, 2020
Automated quantification of myocardial tissue characteristics from native T1 mapping using neural networks with Bayesian inference for uncertainty-based quality-controlEsther Puyol Anton, Bram Ruijsink, Christian F. Baumgartner et al.
Tissue characterisation with CMR parametric mapping has the potential to detect and quantify both focal and diffuse alterations in myocardial structure not assessable by late gadolinium enhancement. Native T1 mapping in particular has shown promise as a useful biomarker to support diagnostic, therapeutic and prognostic decision-making in ischaemic and non-ischaemic cardiomyopathies. Convolutional neural networks with Bayesian inference are a category of artificial neural networks which model the uncertainty of the network output. This study presents an automated framework for tissue characterisation from native ShMOLLI T1 mapping at 1.5T using a Probabilistic Hierarchical Segmentation (PHiSeg) network. In addition, we use the uncertainty information provided by the PHiSeg network in a novel automated quality control (QC) step to identify uncertain T1 values. The PHiSeg network and QC were validated against manual analysis on a cohort of the UK Biobank containing healthy subjects and chronic cardiomyopathy patients. We used the proposed method to obtain reference T1 ranges for the left ventricular myocardium in healthy subjects as well as common clinical cardiac conditions. T1 values computed from automatic and manual segmentations were highly correlated (r=0.97). Bland-Altman analysis showed good agreement between the automated and manual measurements. The average Dice metric was 0.84 for the left ventricular myocardium. The sensitivity of detection of erroneous outputs was 91%. Finally, T1 values were automatically derived from 14,683 CMR exams from the UK Biobank. The proposed pipeline allows for automatic analysis of myocardial native T1 mapping and includes a QC process to detect potentially erroneous results. T1 reference values were presented for healthy subjects and common clinical cardiac conditions from the largest cohort to date using T1-mapping images.
CVJun 14, 2019
A Partially Reversible U-Net for Memory-Efficient Volumetric Image SegmentationRobin Brügger, Christian F. Baumgartner, Ender Konukoglu
One of the key drawbacks of 3D convolutional neural networks for segmentation is their memory footprint, which necessitates compromises in the network architecture in order to fit into a given memory budget. Motivated by the RevNet for image classification, we propose a partially reversible U-Net architecture that reduces memory consumption substantially. The reversible architecture allows us to exactly recover each layer's outputs from the subsequent layer's ones, eliminating the need to store activations for backpropagation. This alleviates the biggest memory bottleneck and enables very deep (theoretically infinitely deep) 3D architectures. On the BraTS challenge dataset, we demonstrate substantial memory savings. We further show that the freed memory can be used for processing the whole field-of-view (FOV) instead of patches. Increasing network depth led to higher segmentation accuracy while growing the memory footprint only by a very small fraction, thanks to the partially reversible architecture.
IVJun 7, 2019
PHiSeg: Capturing Uncertainty in Medical Image SegmentationChristian F. Baumgartner, Kerem C. Tezcan, Krishna Chaitanya et al.
Segmentation of anatomical structures and pathologies is inherently ambiguous. For instance, structure borders may not be clearly visible or different experts may have different styles of annotating. The majority of current state-of-the-art methods do not account for such ambiguities but rather learn a single mapping from image to segmentation. In this work, we propose a novel method to model the conditional probability distribution of the segmentations given an input image. We derive a hierarchical probabilistic model, in which separate latent variables are responsible for modelling the segmentation at different resolutions. Inference in this model can be efficiently performed using the variational autoencoder framework. We show that our proposed method can be used to generate significantly more realistic and diverse segmentation samples compared to recent related work, both, when trained with annotations from a single or multiple annotators.
CVJul 24, 2018
Combining Heterogeneously Labeled Datasets For Training Segmentation NetworksJana Kemnitz, Christian F. Baumgartner, Wolfgang Wirth et al.
Accurate segmentation of medical images is an important step towards analyzing and tracking disease related morphological alterations in the anatomy. Convolutional neural networks (CNNs) have recently emerged as a powerful tool for many segmentation tasks in medical imaging. The performance of CNNs strongly depends on the size of the training data and combining data from different sources is an effective strategy for obtaining larger training datasets. However, this is often challenged by heterogeneous labeling of the datasets. For instance, one of the dataset may be missing labels or a number of labels may have been combined into a super label. In this work we propose a cost function which allows integration of multiple datasets with heterogeneous label subsets into a joint training. We evaluated the performance of this strategy on thigh MR and a cardiac MR datasets in which we artificially merged labels for half of the data. We found the proposed cost function substantially outperforms a naive masking approach, obtaining results very close to using the full annotations.
CVJul 12, 2018
Learning to Segment Medical Images with Scribble-Supervision AloneYigit B. Can, Krishna Chaitanya, Basil Mustafa et al.
Semantic segmentation of medical images is a crucial step for the quantification of healthy anatomy and diseases alike. The majority of the current state-of-the-art segmentation algorithms are based on deep neural networks and rely on large datasets with full pixel-wise annotations. Producing such annotations can often only be done by medical professionals and requires large amounts of valuable time. Training a medical image segmentation network with weak annotations remains a relatively unexplored topic. In this work we investigate training strategies to learn the parameters of a pixel-wise segmentation network from scribble annotations alone. We evaluate the techniques on public cardiac (ACDC) and prostate (NCI-ISBI) segmentation datasets. We find that the networks trained on scribbles suffer from a remarkably small degradation in Dice of only 2.9% (cardiac) and 4.5% (prostate) with respect to a network trained on full annotations.
CVApr 24, 2018
Human-level Performance On Automatic Head Biometrics In Fetal Ultrasound Using Fully Convolutional Neural NetworksMatthew Sinclair, Christian F. Baumgartner, Jacqueline Matthew et al.
Measurement of head biometrics from fetal ultrasonography images is of key importance in monitoring the healthy development of fetuses. However, the accurate measurement of relevant anatomical structures is subject to large inter-observer variability in the clinic. To address this issue, an automated method utilizing Fully Convolutional Networks (FCN) is proposed to determine measurements of fetal head circumference (HC) and biparietal diameter (BPD). An FCN was trained on approximately 2000 2D ultrasound images of the head with annotations provided by 45 different sonographers during routine screening examinations to perform semantic segmentation of the head. An ellipse is fitted to the resulting segmentation contours to mimic the annotation typically produced by a sonographer. The model's performance was compared with inter-observer variability, where two experts manually annotated 100 test images. Mean absolute model-expert error was slightly better than inter-observer error for HC (1.99mm vs 2.16mm), and comparable for BPD (0.61mm vs 0.59mm), as well as Dice coefficient (0.980 vs 0.980). Our results demonstrate that the model performs at a level similar to a human expert, and learns to produce accurate predictions from a large dataset annotated by many sonographers. Additionally, measurements are generated in near real-time at 15fps on a GPU, which could speed up clinical workflow for both skilled and trainee sonographers.
CVNov 30, 2017
MR image reconstruction using deep density priorsKerem C. Tezcan, Christian F. Baumgartner, Roger Luechinger et al.
Algorithms for Magnetic Resonance (MR) image reconstruction from undersampled measurements exploit prior information to compensate for missing k-space data. Deep learning (DL) provides a powerful framework for extracting such information from existing image datasets, through learning, and then using it for reconstruction. Leveraging this, recent methods employed DL to learn mappings from undersampled to fully sampled images using paired datasets, including undersampled and corresponding fully sampled images, integrating prior knowledge implicitly. In this article, we propose an alternative approach that learns the probability distribution of fully sampled MR images using unsupervised DL, specifically Variational Autoencoders (VAE), and use this as an explicit prior term in reconstruction, completely decoupling the encoding operation from the prior. The resulting reconstruction algorithm enjoys a powerful image prior to compensate for missing k-space data without requiring paired datasets for training nor being prone to associated sensitivities, such as deviations in undersampling patterns used in training and test time or coil settings. We evaluated the proposed method with T1 weighted images from a publicly available dataset, multi-coil complex images acquired from healthy volunteers (N=8) and images with white matter lesions. The proposed algorithm, using the VAE prior, produced visually high quality reconstructions and achieved low RMSE values, outperforming most of the alternative methods on the same dataset. On multi-coil complex data, the algorithm yielded accurate magnitude and phase reconstruction results. In the experiments on images with white matter lesions, the method faithfully reconstructed the lesions. Keywords: Reconstruction, MRI, prior probability, machine learning, deep learning, unsupervised learning, density estimation
CVNov 24, 2017
Visual Feature Attribution using Wasserstein GANsChristian F. Baumgartner, Lisa M. Koch, Kerem Can Tezcan et al.
Attributing the pixels of an input image to a certain category is an important and well-studied problem in computer vision, with applications ranging from weakly supervised localisation to understanding hidden effects in the data. In recent years, approaches based on interpreting a previously trained neural network classifier have become the de facto state-of-the-art and are commonly used on medical as well as natural image datasets. In this paper, we discuss a limitation of these approaches which may lead to only a subset of the category specific features being detected. To address this problem we develop a novel feature attribution technique based on Wasserstein Generative Adversarial Networks (WGAN), which does not suffer from this limitation. We show that our proposed method performs substantially better than the state-of-the-art for visual attribution on a synthetic dataset and on real 3D neuroimaging data from patients with mild cognitive impairment (MCI) and Alzheimer's disease (AD). For AD patients the method produces compellingly realistic disease effect maps which are very close to the observed effects.
CVSep 13, 2017
An Exploration of 2D and 3D Deep Learning Techniques for Cardiac MR Image SegmentationChristian F. Baumgartner, Lisa M. Koch, Marc Pollefeys et al.
Accurate segmentation of the heart is an important step towards evaluating cardiac function. In this paper, we present a fully automated framework for segmentation of the left (LV) and right (RV) ventricular cavities and the myocardium (Myo) on short-axis cardiac MR images. We investigate various 2D and 3D convolutional neural network architectures for this task. We investigate the suitability of various state-of-the art 2D and 3D convolutional neural network architectures, as well as slight modifications thereof, for this task. Experiments were performed on the ACDC 2017 challenge training dataset comprising cardiac MR images of 100 patients, where manual reference segmentations were made available for end-diastolic (ED) and end-systolic (ES) frames. We find that processing the images in a slice-by-slice fashion using 2D networks is beneficial due to a relatively large slice thickness. However, the exact network architecture only plays a minor role. We report mean Dice coefficients of $0.950$ (LV), $0.893$ (RV), and $0.899$ (Myo), respectively with an average evaluation time of 1.1 seconds per volume on a modern GPU.
CVDec 16, 2016
SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand UltrasoundChristian F. Baumgartner, Konstantinos Kamnitsas, Jacqueline Matthew et al.
Identifying and interpreting fetal standard scan planes during 2D ultrasound mid-pregnancy examinations are highly complex tasks which require years of training. Apart from guiding the probe to the correct location, it can be equally difficult for a non-expert to identify relevant structures within the image. Automatic image processing can provide tools to help experienced as well as inexperienced operators with these tasks. In this paper, we propose a novel method based on convolutional neural networks which can automatically detect 13 fetal standard views in freehand 2D ultrasound data as well as provide a localisation of the fetal structures via a bounding box. An important contribution is that the network learns to localise the target anatomy using weak supervision based on image-level labels only. The network architecture is designed to operate in real-time while providing optimal output for the localisation task. We present results for real-time annotation, retrospective frame retrieval from saved videos, and localisation on a very large and challenging dataset consisting of images and video recordings of full clinical anomaly screenings. We found that the proposed method achieved an average F1-score of 0.798 in a realistic classification experiment modelling real-time detection, and obtained a 90.09% accuracy for retrospective frame retrieval. Moreover, an accuracy of 77.8% was achieved on the localisation task.
CVApr 29, 2016
Multi-Atlas Segmentation using Partially Annotated Data: Methods and Annotation StrategiesLisa M. Koch, Martin Rajchl, Wenjia Bai et al.
Multi-atlas segmentation is a widely used tool in medical image analysis, providing robust and accurate results by learning from annotated atlas datasets. However, the availability of fully annotated atlas images for training is limited due to the time required for the labelling task. Segmentation methods requiring only a proportion of each atlas image to be labelled could therefore reduce the workload on expert raters tasked with annotating atlas images. To address this issue, we first re-examine the labelling problem common in many existing approaches and formulate its solution in terms of a Markov Random Field energy minimisation problem on a graph connecting atlases and the target image. This provides a unifying framework for multi-atlas segmentation. We then show how modifications in the graph configuration of the proposed framework enable the use of partially annotated atlas images and investigate different partial annotation strategies. The proposed method was evaluated on two Magnetic Resonance Imaging (MRI) datasets for hippocampal and cardiac segmentation. Experiments were performed aimed at (1) recreating existing segmentation techniques with the proposed framework and (2) demonstrating the potential of employing sparsely annotated atlas data for multi-atlas segmentation.