Ben Glocker

CV
h-index45
151papers
24,511citations
Novelty44%
AI Score59

151 Papers

CVJun 3, 2022
Metrics reloaded: Recommendations for image analysis validation

Lena Maier-Hein, Annika Reinke, Patrick Godau et al. · utoronto

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

CYAug 11, 2023
FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare

Karim Lekadir, Aasa Feragen, Abdul Joseph Fofanah et al. · eth-zurich

Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted by patients, clinicians, health organisations and authorities. This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI consortium was founded in 2021 and currently comprises 118 inter-disciplinary experts from 51 countries representing all continents, including AI scientists, clinicians, ethicists, and social scientists. Over a two-year period, the consortium defined guiding principles and best practices for trustworthy AI through an iterative process comprising an in-depth literature review, a modified Delphi survey, and online consensus meetings. The FUTURE-AI framework was established based on 6 guiding principles for trustworthy AI in healthcare, i.e. Fairness, Universality, Traceability, Usability, Robustness and Explainability. Through consensus, a set of 28 best practices were defined, addressing technical, clinical, legal and socio-ethical dimensions. The recommendations cover the entire lifecycle of medical AI, from design, development and validation to regulation, deployment, and monitoring. FUTURE-AI is a risk-informed, assumption-free guideline which provides a structured approach for constructing medical AI tools that will be trusted, deployed and adopted in real-world practice. Researchers are encouraged to take the recommendations into account in proof-of-concept stages to facilitate future translation towards clinical practice of medical AI.

CVMar 2, 2023
Measuring axiomatic soundness of counterfactual image models

Miguel Monteiro, Fabio De Sousa Ribeiro, Nick Pawlowski et al. · microsoft-research

We present a general framework for evaluating image counterfactuals. The power and flexibility of deep generative models make them valuable tools for learning mechanisms in structural causal models. However, their flexibility makes counterfactual identifiability impossible in the general case. Motivated by these issues, we revisit Pearl's axiomatic definition of counterfactuals to determine the necessary constraints of any counterfactual inference model: composition, reversibility, and effectiveness. We frame counterfactuals as functions of an input variable, its parents, and counterfactual parents and use the axiomatic constraints to restrict the set of functions that could represent the counterfactual, thus deriving distance metrics between the approximate and ideal functions. We demonstrate how these metrics can be used to compare and choose between different approximate counterfactual inference models and to provide insight into a model's shortcomings and trade-offs.

CVAug 23, 2022
Deep Structural Causal Shape Models

Rajat Rasal, Daniel C. Castro, Nick Pawlowski et al. · microsoft-research

Causal reasoning provides a language to ask important interventional and counterfactual questions beyond purely statistical association. In medical imaging, for example, we may want to study the causal effect of genetic, environmental, or lifestyle factors on the normal and pathological variation of anatomical phenotypes. However, while anatomical shape models of 3D surface meshes, extracted from automated image segmentation, can be reliably constructed, there is a lack of computational tooling to enable causal reasoning about morphological variations. To tackle this problem, we propose deep structural causal shape models (CSMs), which utilise high-quality mesh generation techniques, from geometric deep learning, within the expressive framework of deep structural causal models. CSMs enable subject-specific prognoses through counterfactual mesh generation ("How would this patient's brain structure change if they were ten years older?"), which is in contrast to most current works on purely population-level statistical shape modelling. We demonstrate the capabilities of CSMs at all levels of Pearl's causal hierarchy through a number of qualitative and quantitative experiments leveraging a large dataset of 3D brain structures.

CVDec 16, 2022Code
Context Label Learning: Improving Background Class Representations in Semantic Segmentation

Zeju Li, Konstantinos Kamnitsas, Cheng Ouyang et al.

Background samples provide key contextual information for segmenting regions of interest (ROIs). However, they always cover a diverse set of structures, causing difficulties for the segmentation model to learn good decision boundaries with high sensitivity and precision. The issue concerns the highly heterogeneous nature of the background class, resulting in multi-modal distributions. Empirically, we find that neural networks trained with heterogeneous background struggle to map the corresponding contextual samples to compact clusters in feature space. As a result, the distribution over background logit activations may shift across the decision boundary, leading to systematic over-segmentation across different datasets and tasks. In this study, we propose context label learning (CoLab) to improve the context representations by decomposing the background class into several subclasses. Specifically, we train an auxiliary network as a task generator, along with the primary segmentation model, to automatically generate context labels that positively affect the ROI segmentation accuracy. Extensive experiments are conducted on several challenging segmentation tasks and datasets. The results demonstrate that CoLab can guide the segmentation model to map the logits of background samples away from the decision boundary, resulting in significantly improved segmentation accuracy. Code is available.

LGApr 22, 2022
Federated Learning Enables Big Data for Rare Cancer Boundary Detection

Sarthak Pati, Ujjwal Baid, Brandon Edwards et al.

Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.

IVJul 5, 2022Code
Vector Quantisation for Robust Segmentation

Ainkaran Santhirasekaram, Avinash Kori, Mathias Winkler et al.

The reliability of segmentation models in the medical domain depends on the model's robustness to perturbations in the input space. Robustness is a particular challenge in medical imaging exhibiting various sources of image noise, corruptions, and domain shifts. Obtaining robustness is often attempted via simulating heterogeneous environments, either heuristically in the form of data augmentation or by learning to generate specific perturbations in an adversarial manner. We propose and justify that learning a discrete representation in a low dimensional embedding space improves robustness of a segmentation model. This is achieved with a dictionary learning method called vector quantisation. We use a set of experiments designed to analyse robustness in both the latent and output space under domain shift and noise perturbations in the input space. We adapt the popular UNet architecture, inserting a quantisation block in the bottleneck. We demonstrate improved segmentation accuracy and better robustness on three segmentation tasks. Code is available at \url{https://github.com/AinkaranSanthi/Vector-Quantisation-for-Robust-Segmentation}

CVJul 5, 2022Code
GLANCE: Global to Local Architecture-Neutral Concept-based Explanations

Avinash Kori, Ben Glocker, Francesca Toni

Most of the current explainability techniques focus on capturing the importance of features in input space. However, given the complexity of models and data-generating processes, the resulting explanations are far from being `complete', in that they lack an indication of feature interactions and visualization of their `effect'. In this work, we propose a novel twin-surrogate explainability framework to explain the decisions made by any CNN-based image classifier (irrespective of the architecture). For this, we first disentangle latent features from the classifier, followed by aligning these features to observed/human-defined `context' features. These aligned features form semantically meaningful concepts that are used for extracting a causal graph depicting the `perceived' data-generating process, describing the inter- and intra-feature interactions between unobserved latent features and observed `context' features. This causal graph serves as a global model from which local explanations of different forms can be extracted. Specifically, we provide a generator to visualize the `effect' of interactions among features in latent space and draw feature importance therefrom as local explanations. Our framework utilizes adversarial knowledge distillation to faithfully learn a representation from the classifiers' latent space and use it for extracting visual explanations. We use the styleGAN-v2 architecture with an additional regularization term to enforce disentanglement and alignment. We demonstrate and evaluate explanations obtained with our framework on Morpho-MNIST and on the FFHQ human faces dataset. Our framework is available at \url{https://github.com/koriavinash1/GLANCE-Explanations}.

LGJul 31, 2023
No Fair Lunch: A Causal Perspective on Dataset Bias in Machine Learning for Medical Imaging

Charles Jones, Daniel C. Castro, Fabio De Sousa Ribeiro et al. · microsoft-research

As machine learning methods gain prominence within clinical decision-making, addressing fairness concerns becomes increasingly urgent. Despite considerable work dedicated to detecting and ameliorating algorithmic bias, today's methods are deficient with potentially harmful consequences. Our causal perspective sheds new light on algorithmic bias, highlighting how different sources of dataset bias may appear indistinguishable yet require substantially different mitigation strategies. We introduce three families of causal bias mechanisms stemming from disparities in prevalence, presentation, and annotation. Our causal analysis underscores how current mitigation methods tackle only a narrow and often unrealistic subset of scenarios. We provide a practical three-step framework for reasoning about fairness in medical imaging, supporting the development of safe and equitable AI prediction models.

CVJul 5, 2022Code
Hierarchical Symbolic Reasoning in Hyperbolic Space for Deep Discriminative Models

Ainkaran Santhirasekaram, Avinash Kori, Andrea Rockall et al.

Explanations for \emph{black-box} models help us understand model decisions as well as provide information on model biases and inconsistencies. Most of the current explainability techniques provide a single level of explanation, often in terms of feature importance scores or feature attention maps in input space. Our focus is on explaining deep discriminative models at \emph{multiple levels of abstraction}, from fine-grained to fully abstract explanations. We achieve this by using the natural properties of \emph{hyperbolic geometry} to more efficiently model a hierarchy of symbolic features and generate \emph{hierarchical symbolic rules} as part of our explanations. Specifically, for any given deep discriminative model, we distill the underpinning knowledge by discretisation of the continuous latent space using vector quantisation to form symbols, followed by a \emph{hyperbolic reasoning block} to induce an \emph{abstraction tree}. We traverse the tree to extract explanations in terms of symbolic rules and its corresponding visual semantics. We demonstrate the effectiveness of our method on the MNIST and AFHQ high-resolution animal faces dataset. Our framework is available at \url{https://github.com/koriavinash1/SymbolicInterpretability}.

CVNov 3, 2022
Analysing the effectiveness of a generative model for semi-supervised medical image segmentation

Margherita Rosnati, Fabio De Sousa Ribeiro, Miguel Monteiro et al. · microsoft-research

Image segmentation is important in medical imaging, providing valuable, quantitative information for clinical decision-making in diagnosis, therapy, and intervention. The state-of-the-art in automated segmentation remains supervised learning, employing discriminative models such as U-Net. However, training these models requires access to large amounts of manually labelled data which is often difficult to obtain in real medical applications. In such settings, semi-supervised learning (SSL) attempts to leverage the abundance of unlabelled data to obtain more robust and reliable models. Recently, generative models have been proposed for semantic segmentation, as they make an attractive choice for SSL. Their ability to capture the joint distribution over input images and output label maps provides a natural way to incorporate information from unlabelled images. This paper analyses whether deep generative models such as the SemanticGAN are truly viable alternatives to tackle challenging medical image segmentation problems. To that end, we thoroughly evaluate the segmentation performance, robustness, and potential subgroup disparities of discriminative and generative segmentation methods when applied to large-scale, publicly available chest X-ray datasets.

CVFeb 3, 2023
Understanding metric-related pitfalls in image analysis validation

Annika Reinke, Minu D. Tizabi, Michael Baumgartner et al.

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

CVAug 14, 2023Code
Distance Matters For Improving Performance Estimation Under Covariate Shift

Mélanie Roschewitz, Ben Glocker

Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.

LGSep 7, 2022
Risk of Bias in Chest Radiography Deep Learning Foundation Models

Ben Glocker, Charles Jones, Melanie Roschewitz et al.

Purpose: To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biological sex and race. Materials and Methods: This retrospective study used 127,118 chest radiographs from 42,884 patients (mean age, 63 [SD] 17 years; 23,623 male, 19,261 female) from the CheXpert dataset collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov-Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups. Results: Ten out of twelve pairwise comparisons across biological sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black patients (P < .001) in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the 'no finding' label dropped between 6.8% and 7.8% for female patients, and performance in detecting 'pleural effusion' dropped between 10.7% and 11.6% for Black patients. Conclusion: The studied chest radiography foundation model demonstrated racial and sex-related bias leading to disparate performance across patient subgroups and may be unsafe for clinical applications.

LGJun 27, 2023
High Fidelity Image Counterfactuals with Probabilistic Causal Models

Fabio De Sousa Ribeiro, Tian Xia, Miguel Monteiro et al.

We present a general causal generative modelling framework for accurate estimation of high fidelity image counterfactuals with deep structural causal models. Estimation of interventional and counterfactual queries for high-dimensional structured variables, such as images, remains a challenging task. We leverage ideas from causal mediation analysis and advances in generative modelling to design new deep causal mechanisms for structured variables in causal models. Our experiments demonstrate that our proposed mechanisms are capable of accurate abduction and estimation of direct, indirect and total effects as measured by axiomatic soundness of counterfactuals.

LGJul 18, 2023
Grounded Object Centric Learning

Avinash Kori, Francesco Locatello, Fabio De Sousa Ribeiro et al.

The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to \textit{slots}, but presupposes a \textit{single} distribution from which all slots are randomly initialised. This results in an inability to learn \textit{specialized} slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present \emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in popular object discovery benchmarks.

IVSep 19, 2023
Analysing race and sex bias in brain age prediction

Carolina Piçarra, Ben Glocker

Brain age prediction from MRI has become a popular imaging biomarker associated with a wide range of neuropathologies. The datasets used for training, however, are often skewed and imbalanced regarding demographics, potentially making brain age prediction models susceptible to bias. We analyse the commonly used ResNet-34 model by conducting a comprehensive subgroup performance analysis and feature inspection. The model is trained on 1,215 T1-weighted MRI scans from Cam-CAN and IXI, and tested on UK Biobank (n=42,786), split into six racial and biological sex subgroups. With the objective of comparing the performance between subgroups, measured by the absolute prediction error, we use a Kruskal-Wallis test followed by two post-hoc Conover-Iman tests to inspect bias across race and biological sex. To examine biases in the generated features, we use PCA for dimensionality reduction and employ two-sample Kolmogorov-Smirnov tests to identify distribution shifts among subgroups. Our results reveal statistically significant differences in predictive performance between Black and White, Black and Asian, and male and female subjects. Seven out of twelve pairwise comparisons show statistically significant differences in the feature distributions. Our findings call for further analysis of brain age prediction models.

CVJul 20, 2022
Estimating Model Performance under Domain Shifts with Class-Specific Confidence Scores

Zeju Li, Konstantinos Kamnitsas, Mobarakol Islam et al.

Machine learning models are typically deployed in a test setting that differs from the training setting, potentially leading to decreased model performance because of domain shift. If we could estimate the performance that a pre-trained model would achieve on data from a specific deployment setting, for example a certain clinic, we could judge whether the model could safely be deployed or if its performance degrades unacceptably on the specific data. Existing approaches estimate this based on the confidence of predictions made on unlabeled test data from the deployment's domain. We find existing methods struggle with data that present class imbalance, because the methods used to calibrate confidence do not account for bias induced by class imbalance, consequently failing to estimate class-wise accuracy. Here, we introduce class-wise calibration within the framework of performance estimation for imbalanced datasets. Specifically, we derive class-specific modifications of state-of-the-art confidence-based model evaluation methods including temperature scaling (TS), difference of confidences (DoC), and average thresholded confidence (ATC). We also extend the methods to estimate Dice similarity coefficient (DSC) in image segmentation. We conduct experiments on four tasks and find the proposed modifications consistently improve the estimation accuracy for imbalanced datasets. Our methods improve accuracy estimation by 18\% in classification under natural domain shifts, and double the estimation accuracy on segmentation tasks, when compared with prior methods.

CVJul 6, 2023
The Role of Subgroup Separability in Group-Fair Medical Image Classification

Charles Jones, Mélanie Roschewitz, Ben Glocker

We investigate performance disparities in deep classifiers. We find that the ability of classifiers to separate individuals into subgroups varies substantially across medical imaging modalities and protected characteristics; crucially, we show that this property is predictive of algorithmic bias. Through theoretical analysis and extensive empirical evaluation, we find a relationship between subgroup separability, subgroup disparities, and performance degradation when models are trained on data with systematic bias such as underdiagnosis. Our findings shed new light on the question of how models become biased, providing important insights for the development of fair medical imaging AI.

AIMay 27, 2022
Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed

Melanie Bernhardt, Fabio De Sousa Ribeiro, Ben Glocker

Failure detection in automated image classification is a critical safeguard for clinical deployment. Detected failure cases can be referred to human assessment, ensuring patient safety in computer-aided clinical decision making. Despite its paramount importance, there is insufficient evidence about the ability of state-of-the-art confidence scoring methods to detect test-time failures of classification models in the context of medical imaging. This paper provides a reality check, establishing the performance of in-domain misclassification detection methods, benchmarking 9 widely used confidence scores on 6 medical imaging datasets with different imaging modalities, in multiclass and binary classification settings. Our experiments show that the problem of failure detection is far from being solved. We found that none of the benchmarked advanced methods proposed in the computer vision and machine learning literature can consistently outperform a simple softmax baseline, demonstrating that improved out-of-distribution detection or model calibration do not necessarily translate to improved in-domain misclassification detection. Our developed testbed facilitates future work in this important area

CVFeb 2, 2023
Paced-Curriculum Distillation with Prediction and Label Uncertainty for Image Segmentation

Mobarakol Islam, Lalithkumar Seenivasan, S. P. Sharan et al.

Purpose: In curriculum learning, the idea is to train on easier samples first and gradually increase the difficulty, while in self-paced learning, a pacing function defines the speed to adapt the training progress. While both methods heavily rely on the ability to score the difficulty of data samples, an optimal scoring function is still under exploration. Methodology: Distillation is a knowledge transfer approach where a teacher network guides a student network by feeding a sequence of random samples. We argue that guiding student networks with an efficient curriculum strategy can improve model generalization and robustness. For this purpose, we design an uncertainty-based paced curriculum learning in self distillation for medical image segmentation. We fuse the prediction uncertainty and annotation boundary uncertainty to develop a novel paced-curriculum distillation (PCD). We utilize the teacher model to obtain prediction uncertainty and spatially varying label smoothing with Gaussian kernel to generate segmentation boundary uncertainty from the annotation. We also investigate the robustness of our method by applying various types and severity of image perturbation and corruption. Results: The proposed technique is validated on two medical datasets of breast ultrasound image segmentation and robotassisted surgical scene segmentation and achieved significantly better performance in terms of segmentation and robustness. Conclusion: P-CD improves the performance and obtains better generalization and robustness over the dataset shift. While curriculum learning requires extensive tuning of hyper-parameters for pacing function, the level of performance improvement suppresses this limitation.

IVAug 14, 2023
Robustness Stress Testing in Medical Image Classification

Mobarakol Islam, Zeju Li, Ben Glocker

Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.

CVSep 20, 2022
Frequency Dropout: Feature-Level Regularization via Randomized Filtering

Mobarakol Islam, Ben Glocker

Deep convolutional neural networks have shown remarkable performance on various computer vision tasks, and yet, they are susceptible to picking up spurious correlations from the training signal. So called `shortcuts' can occur during learning, for example, when there are specific frequencies present in the image data that correlate with the output predictions. Both high and low frequencies can be characteristic of the underlying noise distribution caused by the image acquisition rather than in relation to the task-relevant information about the image content. Models that learn features related to this characteristic noise will not generalize well to new data. In this work, we propose a simple yet effective training strategy, Frequency Dropout, to prevent convolutional neural networks from learning frequency-specific imaging features. We employ randomized filtering of feature maps during training which acts as a feature-level regularization. In this study, we consider common image processing filters such as Gaussian smoothing, Laplacian of Gaussian, and Gabor filtering. Our training strategy is model-agnostic and can be used for any computer vision task. We demonstrate the effectiveness of Frequency Dropout on a range of popular architectures and multiple tasks including image classification, domain adaptation, and semantic segmentation using both computer vision and medical imaging datasets. Our results suggest that the proposed approach does not only improve predictive accuracy but also improves robustness against domain shift.

CVOct 29, 2022
A Comparative Study of Graph Neural Networks for Shape Classification in Neuroimaging

Nairouz Shehata, Wulfie Bain, Ben Glocker

Graph neural networks have emerged as a promising approach for the analysis of non-Euclidean data such as meshes. In medical imaging, mesh-like data plays an important role for modelling anatomical structures, and shape classification can be used in computer aided diagnosis and disease detection. However, with a plethora of options, the best architectural choices for medical shape analysis using GNNs remain unclear. We conduct a comparative analysis to provide practitioners with an overview of the current state-of-the-art in geometric deep learning for shape classification in neuroimaging. Using biological sex classification as a proof-of-concept task, we find that using FPFH as node features substantially improves GNN performance and generalisation to out-of-distribution data; we compare the performance of three alternative convolutional layers; and we reinforce the importance of data augmentation for graph based learning. We then confirm these results hold for a clinically relevant task, using the classification of Alzheimer's disease.

CVAug 8, 2022
Automatic lesion analysis for increased efficiency in outcome prediction of traumatic brain injury

Margherita Rosnati, Eyal Soreq, Miguel Monteiro et al.

The accurate prognosis for traumatic brain injury (TBI) patients is difficult yet essential to inform therapy, patient management, and long-term after-care. Patient characteristics such as age, motor and pupil responsiveness, hypoxia and hypotension, and radiological findings on computed tomography (CT), have been identified as important variables for TBI outcome prediction. CT is the acute imaging modality of choice in clinical practice because of its acquisition speed and widespread availability. However, this modality is mainly used for qualitative and semi-quantitative assessment, such as the Marshall scoring system, which is prone to subjectivity and human errors. This work explores the predictive power of imaging biomarkers extracted from routinely-acquired hospital admission CT scans using a state-of-the-art, deep learning TBI lesion segmentation method. We use lesion volumes and corresponding lesion statistics as inputs for an extended TBI outcome prediction model. We compare the predictive power of our proposed features to the Marshall score, independently and when paired with classic TBI biomarkers. We find that automatically extracted quantitative CT features perform similarly or better than the Marshall score in predicting unfavourable TBI outcomes. Leveraging automatic atlas alignment, we also identify frontal extra-axial lesions as important indicators of poor outcome. Our work may contribute to a better understanding of TBI, and provides new insights into how automated neuroimaging analysis can be used to improve prognostication after TBI.

CVJul 30, 2024
SharkTrack: an accurate, generalisable software for streamlining shark and ray underwater video analysis

Filippo Varini, Joel H. Gayford, Jeremy Jenrette et al.

Elasmobranchs (shark sand rays) represent a critical component of marine ecosystems. Yet, they are experiencing global population declines and effective monitoring of populations is essential to their protection. Underwater stationary videos, such as those from Baited Remote Underwater Video Stations (BRUVS), are critical for understanding elasmobranch spatial ecology and abundance. However, processing these videos requires time-consuming manual analysis that can delay conservation. To address this challenge, we developed SharkTrack, a semi-automatic underwater video analysis software. SharkTrack uses Convolutional Neural Networks (CNN) and Multi-Object Tracking to automatically detect and track elasmobranchs and provides an annotation pipeline to manually classify elasmobranch species and compute species-specific MaxN (ssMaxN), the standard metric of relative abundance. When tested on BRUVS footage from locations unseen by the CNN model during training, SharkTrack computed ssMaxN with 89% accuracy over 207 hours of footage. The semi-automatic SharkTrack pipeline required two minutes of manual classification per hour of video, an estimated 95% reduction of manual analysis time compared to traditional methods. Furthermore, we demonstrate SharkTrack accuracy across diverse marine ecosystems and elasmobranch species, an advancement compared to previous models, which were limited to specific species or locations. SharkTrack applications extend beyond BRUVS, facilitating the analysis of any underwater stationary video. By making video analysis faster and more accessible, SharkTrack enables research and conservation organisations to monitor elasmobranch populations more efficiently, thereby improving conservation efforts. To further support these goals, we provide public access to the SharkTrack software.

CVDec 22, 2022
Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding

Mengya Xu, Mobarakol Islam, Ben Glocker et al.

Curriculum learning and self-paced learning are the training strategies that gradually feed the samples from easy to more complex. They have captivated increasing attention due to their excellent performance in robotic vision. Most recent works focus on designing curricula based on difficulty levels in input samples or smoothing the feature maps. However, smoothing labels to control the learning utility in a curriculum manner is still unexplored. In this work, we design a paced curriculum by label smoothing (P-CBLS) using paced learning with uniform label smoothing (ULS) for classification tasks and fuse uniform and spatially varying label smoothing (SVLS) for semantic segmentation tasks in a curriculum manner. In ULS and SVLS, a bigger smoothing factor value enforces a heavy smoothing penalty in the true label and limits learning less information. Therefore, we design the curriculum by label smoothing (CBLS). We set a bigger smoothing value at the beginning of training and gradually decreased it to zero to control the model learning utility from lower to higher. We also designed a confidence-aware pacing function and combined it with our CBLS to investigate the benefits of various curricula. The proposed techniques are validated on four robotic surgery datasets of multi-class, multi-label classification, captioning, and segmentation tasks. We also investigate the robustness of our method by corrupting validation data into different severity levels. Our extensive analysis shows that the proposed method improves prediction accuracy and robustness.

CVOct 17, 2022
Explaining Image Classification with Visual Debates

Avinash Kori, Ben Glocker, Francesca Toni

An effective way to obtain different perspectives on any given topic is by conducting a debate, where participants argue for and against the topic. Here, we propose a novel debate framework for understanding and explaining a continuous image classifier's reasoning for making a particular prediction by modeling it as a multiplayer sequential zero-sum debate game. The contrastive nature of our framework encourages players to learn to put forward diverse arguments during the debates, picking up the reasoning trails missed by their opponents and highlighting any uncertainties in the classifier. Specifically, in our proposed setup, players propose arguments, drawn from the classifier's discretized latent knowledge, to support or oppose the classifier's decision. The resulting Visual Debates collect supporting and opposing features from the discretized latent space of the classifier, serving as explanations for the internal reasoning of the classifier towards its predictions. We demonstrate and evaluate (a practical realization of) our Visual Debates on the geometric SHAPE and MNIST datasets and on the high-resolution animal faces (AFHQ) dataset, along standard evaluation metrics for explanations (i.e. faithfulness and completeness) and novel, bespoke metrics for visual debates as explanations (consensus and split ratio).

LGMay 25, 2022
Structured Uncertainty in the Observation Space of Variational Autoencoders

James Langley, Miguel Monteiro, Charles Jones et al.

Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose SOS-VAE, an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically-meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.

LGJul 11, 2023
A Causal Ordering Prior for Unsupervised Representation Learning

Avinash Kori, Pedro Sanchez, Konstantinos Vilouras et al.

Unsupervised representation learning with variational inference relies heavily on independence assumptions over latent variables. Causal representation learning (CRL), however, argues that factors of variation in a dataset are, in fact, causally related. Allowing latent variables to be correlated, as a consequence of causal relationships, is more realistic and generalisable. So far, provably identifiable methods rely on: auxiliary information, weak labels, and interventional or even counterfactual data. Inspired by causal discovery with functional causal models, we propose a fully unsupervised representation learning method that considers a data generation process with a latent additive noise model (ANM). We encourage the latent space to follow a causal ordering via loss function based on the Hessian of the latent distribution.

IVAug 8, 2024
Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation

Kate Čevora, Ben Glocker, Wenjia Bai

Deep learning-based medical image segmentation has seen tremendous progress over the last decade, but there is still relatively little transfer into clinical practice. One of the main barriers is the challenge of domain generalisation, which requires segmentation models to maintain high performance across a wide distribution of image data. This challenge is amplified by the many factors that contribute to the diverse appearance of medical images, such as acquisition conditions and patient characteristics. The impact of shifting patient characteristics such as age and sex on segmentation performance remains relatively under-studied, especially for abdominal organs, despite that this is crucial for ensuring the fairness of the segmentation model. We perform the first study to determine the impact of population shift with respect to age and sex on abdominal CT image segmentation, by leveraging two large public datasets, and introduce a novel metric to quantify the impact. We find that population shift is a challenge similar in magnitude to cross-dataset shift for abdominal organ segmentation, and that the effect is asymmetric and dataset-dependent. We conclude that dataset diversity in terms of known patient characteristics is not necessarily equivalent to dataset diversity in terms of image features. This implies that simple population matching to ensure good generalisation and fairness may be insufficient, and we recommend that fairness research should be directed towards better understanding and quantifying medical image dataset diversity in terms of performance-relevant characteristics such as organ morphology.

CVJun 27, 2022
Distributional Gaussian Processes Layers for Out-of-Distribution Detection

Sebastian G. Popescu, David J. Sharp, James H. Cole et al.

Machine learning models deployed on medical imaging tasks must be equipped with out-of-distribution detection capabilities in order to avoid erroneous predictions. It is unsure whether out-of-distribution detection models reliant on deep neural networks are suitable for detecting domain shifts in medical imaging. Gaussian Processes can reliably separate in-distribution data points from out-of-distribution data points via their mathematical construction. Hence, we propose a parameter efficient Bayesian layer for hierarchical convolutional Gaussian Processes that incorporates Gaussian Processes operating in Wasserstein-2 space to reliably propagate uncertainty. This directly replaces convolving Gaussian Processes with a distance-preserving affine operator on distributions. Our experiments on brain tissue-segmentation show that the resulting architecture approaches the performance of well-established deterministic segmentation algorithms (U-Net), which has not been achieved with previous hierarchical Gaussian Processes. Moreover, by applying the same segmentation model to out-of-distribution data (i.e., images with pathology such as brain tumors), we show that our uncertainty estimates result in out-of-distribution detection that outperforms the capabilities of previous Bayesian networks and reconstruction-based approaches that learn normative distributions. To facilitate future work our code is publicly available.

CVJan 1, 2023
Image To Tree with Recursive Prompting

James Batten, Matthew Sinclair, Ben Glocker et al.

Extracting complex structures from grid-based data is a common key step in automated medical image analysis. The conventional solution to recovering tree-structured geometries typically involves computing the minimal cost path through intermediate representations derived from segmentation masks. However, this methodology has significant limitations in the context of projective imaging of tree-structured 3D anatomical data such as coronary arteries, since there are often overlapping branches in the 2D projection. In this work, we propose a novel approach to predicting tree connectivity structure which reformulates the task as an optimization problem over individual steps of a recursive process. We design and train a two-stage model which leverages the UNet and Transformer architectures and introduces an image-based prompting technique. Our proposed method achieves compelling results on a pair of synthetic datasets, and outperforms a shortest-path baseline.

CVSep 9, 2024
Latent 3D Brain MRI Counterfactual

Wei Peng, Tian Xia, Fabio De Sousa Ribeiro et al.

The number of samples in structural brain MRI studies is often too small to properly train deep learning models. Generative models show promise in addressing this issue by effectively learning the data distribution and generating high-fidelity MRI. However, they struggle to produce diverse, high-quality data outside the distribution defined by the training data. One way to address this issue is to use causal models developed for 3D volume counterfactuals. However, accurately modeling causality in high-dimensional spaces is challenging, so these models generally generate 3D brain MRIs of lower quality. To address these challenges, we propose a two-stage method that constructs a Structural Causal Model (SCM) within the latent space. In the first stage, we employ a VQ-VAE to learn a compact embedding of the MRI volume. Subsequently, we integrate our causal model into this latent space and execute a three-step counterfactual procedure using a closed-form Generalized Linear Model (GLM). Our experiments conducted on real-world high-resolution MRI data (1 mm) provided by the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA) demonstrate that our method can generate high-quality 3D MRI counterfactuals.

LGMar 13
A Causal Framework for Mitigating Data Shifts in Healthcare

Kurt Butler, Stephanie Riley, Damian Machlanski et al.

Developing predictive models that perform reliably across diverse patient populations and heterogeneous environments is a core aim of medical research. However, generalization is only possible if the learned model is robust to statistical differences between data used for training and data seen at the time and place of deployment. Domain generalization methods provide strategies to address data shifts, but each method comes with its own set of assumptions and trade-offs. To apply these methods in healthcare, we must understand how domain shifts arise, what assumptions we prefer to make, and what our design constraints are. This article proposes a causal framework for the design of predictive models to improve generalization. Causality provides a powerful language to characterize and understand diverse domain shifts, regardless of data modality. This allows us to pinpoint why models fail to generalize, leading to more principled strategies to prepare for and adapt to shifts. We recommend general mitigation strategies, discussing trade-offs and highlighting existing work. Our causality-based perspective offers a critical foundation for developing robust, interpretable, and clinically relevant AI solutions in healthcare, paving the way for reliable real-world deployment.

CVMar 22
Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis

Tian Xia, Matthew Sinclair, Andreas Schuh et al.

Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.

CVSep 16, 2024
Robust image representations with counterfactual contrastive learning

Mélanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia et al.

Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging, such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning reducing subgroup disparities across biological sex.

CVNov 13, 2023
Robust semi-supervised segmentation with timestep ensembling diffusion models

Margherita Rosnati, Melanie Roschewitz, Ben Glocker

Medical image segmentation is a challenging task, made more difficult by many datasets' limited size and annotations. Denoising diffusion probabilistic models (DDPM) have recently shown promise in modelling the distribution of natural images and were successfully applied to various medical imaging tasks. This work focuses on semi-supervised image segmentation using diffusion models, particularly addressing domain generalisation. Firstly, we demonstrate that smaller diffusion steps generate latent representations that are more robust for downstream tasks than larger steps. Secondly, we use this insight to propose an improved esembling scheme that leverages information-dense small steps and the regularising effect of larger steps to generate predictions. Our model shows significantly better performance in domain-shifted settings while retaining competitive performance in-domain. Overall, this work highlights the potential of DDPMs for semi-supervised medical image segmentation and provides insights into optimising their performance under domain shift.

CVMar 11, 2024Code
Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation

Theodore Barfoot, Luis Garcia-Peraza-Herrera, Ben Glocker et al.

Deep neural networks for medical image segmentation often produce overconfident results misaligned with empirical observations. Such miscalibration, challenges their clinical translation. We propose to use marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss function to improve pixel-wise calibration without compromising segmentation quality. We show that this loss, despite using hard binning, is directly differentiable, bypassing the need for approximate but differentiable surrogate or soft binning approaches. Our work also introduces the concept of dataset reliability histograms which generalises standard reliability diagrams for refined visual assessment of calibration in semantic segmentation aggregated at the dataset level. Using mL1-ACE, we reduce average and maximum calibration error by 45% and 55% respectively, maintaining a Dice score of 87% on the BraTS 2021 dataset. We share our code here: https://github.com/cai4cai/ACE-DLIRIS

CVMar 17
Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

Marceau Lafargue-Hauret, Raghav Mehta, Fabio De Sousa Ribeiro et al.

Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.

CVJun 4, 2025Code
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay et al.

Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses

CVMay 11
Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta et al.

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

CVSep 29, 2025Code
Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis

Tian Xia, Matthew Sinclair, Andreas Schuh et al.

Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient's age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

CVJul 24, 2025Code
Flow Stochastic Segmentation Networks

Fabio De Sousa Ribeiro, Omar Todd, Charles Jones et al.

We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: https://github.com/biomedia-mira/flow-ssn.

IVJun 27, 2025Code
Cardiovascular disease classification using radiomics and geometric features from cardiac CT

Ajay Mittal, Raghav Mehta, Omar Todd et al.

Automatic detection and classification of Cardiovascular disease (CVD) from Computed Tomography (CT) images play an important part in facilitating better-informed clinical decisions. However, most of the recent deep learning based methods either directly work on raw CT data or utilize it in pair with anatomical cardiac structure segmentation by training an end-to-end classifier. As such, these approaches become much more difficult to interpret from a clinical perspective. To address this challenge, in this work, we break down the CVD classification pipeline into three components: (i) image segmentation, (ii) image registration, and (iii) downstream CVD classification. Specifically, we utilize the Atlas-ISTN framework and recent segmentation foundational models to generate anatomical structure segmentation and a normative healthy atlas. These are further utilized to extract clinically interpretable radiomic features as well as deformation field based geometric features (through atlas registration) for CVD classification. Our experiments on the publicly available ASOCA dataset show that utilizing these features leads to better CVD classification accuracy (87.50\%) when compared against classification model trained directly on raw CT images (67.50\%). Our code is publicly available: https://github.com/biomedia-mira/grc-net

CVSep 11, 2021Code
Class-Distribution-Aware Calibration for Long-Tailed Visual Recognition

Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren et al.

Despite impressive accuracy, deep neural networks are often miscalibrated and tend to overly confident predictions. Recent techniques like temperature scaling (TS) and label smoothing (LS) show effectiveness in obtaining a well-calibrated model by smoothing logits and hard labels with scalar factors, respectively. However, the use of uniform TS or LS factor may not be optimal for calibrating models trained on a long-tailed dataset where the model produces overly confident probabilities for high-frequency classes. In this study, we propose class-distribution-aware TS (CDA-TS) and LS (CDA-LS) by incorporating class frequency information in model calibration in the context of long-tailed distribution. In CDA-TS, the scalar temperature value is replaced with the CDA temperature vector encoded with class frequency to compensate for the over-confidence. Similarly, CDA-LS uses a vector smoothing factor and flattens the hard labels according to their corresponding class distribution. We also integrate CDA optimal temperature vector with distillation loss, which reduces miscalibration in self-distillation (SD). We empirically show that class-distribution-aware TS and LS can accommodate the imbalanced data distribution yielding superior performance in both calibration error and predictive accuracy. We also observe that SD with an extremely imbalanced dataset is less effective in terms of calibration performance. Code is available in https://github.com/mobarakol/Class-Distribution-Aware-TS-LS.

MLJun 11, 2020Code
Deep Structural Causal Models for Tractable Counterfactual Inference

Nick Pawlowski, Daniel C. Castro, Ben Glocker

We formulate a general framework for building structural causal models (SCMs) with deep learning components. The proposed approach employs normalising flows and variational inference to enable tractable inference of exogenous noise variables - a crucial step for counterfactual inference that is missing from existing deep causal learning methods. Our framework is validated on a synthetic dataset built on MNIST as well as on a real-world medical dataset of brain MRI scans. Our experimental results indicate that we can successfully train deep SCMs that are capable of all three levels of Pearl's ladder of causation: association, intervention, and counterfactuals, giving rise to a powerful new approach for answering causal questions in imaging applications and beyond. The code for all our experiments is available at https://github.com/biomedia-mira/deepscm.

CVJan 24, 2020Code
VerSe: A Vertebrae Labelling and Segmentation Benchmark for Multi-detector CT Images

Anjany Sekuboyina, Malek E. Husseini, Amirhossein Bayat et al.

Vertebral labelling and segmentation are two fundamental tasks in an automated spine processing pipeline. Reliable and accurate processing of spine images is expected to benefit clinical decision-support systems for diagnosis, surgery planning, and population-based analysis on spine and bone health. However, designing automated algorithms for spine processing is challenging predominantly due to considerable variations in anatomy and acquisition protocols and due to a severe shortage of publicly available data. Addressing these limitations, the Large Scale Vertebrae Segmentation Challenge (VerSe) was organised in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in 2019 and 2020, with a call for algorithms towards labelling and segmentation of vertebrae. Two datasets containing a total of 374 multi-detector CT scans from 355 patients were prepared and 4505 vertebrae have individually been annotated at voxel-level by a human-machine hybrid algorithm (https://osf.io/nqjyw/, https://osf.io/t98fz/). A total of 25 algorithms were benchmarked on these datasets. In this work, we present the the results of this evaluation and further investigate the performance-variation at vertebra-level, scan-level, and at different fields-of-view. We also evaluate the generalisability of the approaches to an implicit domain shift in data by evaluating the top performing algorithms of one challenge iteration on data from the other iteration. The principal takeaway from VerSe: the performance of an algorithm in labelling and segmenting a spine scan hinges on its ability to correctly identify vertebrae in cases of rare anatomical variations. The content and code concerning VerSe can be accessed at: https://github.com/anjany/verse.

CVMar 5, 2019Code
FastReg: Fast Non-Rigid Registration via Accelerated Optimisation on the Manifold of Diffeomorphisms

Daniel Grzech, Loïc le Folgoc, Mattias P. Heinrich et al.

We present an implementation of a new approach to diffeomorphic non-rigid registration of medical images. The method is based on optical flow and warps images via gradient flow with the standard $L^2$ inner product. To compute the transformation, we rely on accelerated optimisation on the manifold of diffeomorphisms. We achieve regularity properties of Sobolev gradient flows, which are expensive to compute, owing to a novel method of averaging the gradients in time rather than space. We successfully register brain MRI and challenging abdominal CT scans at speeds orders of magnitude faster than previous approaches. We make our code available in a public repository: https://github.com/dgrzech/fastreg

LGSep 27, 2018Code
Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning

Daniel C. Castro, Jeremy Tan, Bernhard Kainz et al.

Revealing latent structure in data is an active field of research, having introduced exciting technologies such as variational autoencoders and adversarial networks, and is essential to push machine learning towards unsupervised knowledge discovery. However, a major challenge is the lack of suitable benchmarks for an objective and quantitative evaluation of learned representations. To address this issue we introduce Morpho-MNIST, a framework that aims to answer: "to what extent has my model learned to represent specific factors of variation in the data?" We extend the popular MNIST dataset by adding a morphometric analysis enabling quantitative comparison of trained models, identification of the roles of latent variables, and characterisation of sample diversity. We further propose a set of quantifiable perturbations to assess the performance of unsupervised and supervised methods on challenging tasks such as outlier detection and domain adaptation. Data and code are available at https://github.com/dccastro/Morpho-MNIST.