CVAug 16, 2024Code
Deep Generative Classification of Blood Cell MorphologySimon Deltadahl, Julian Gilbey, Christine Van Laer et al.
Accurate classification of haematological cells is critical for diagnosing blood disorders, but presents significant challenges for machine automation owing to the complexity of cell morphology, heterogeneities of biological, pathological, and imaging characteristics, and the imbalance of cell type frequencies. We introduce CytoDiffusion, a diffusion-based classifier that effectively models blood cell morphology, combining accurate classification with robust anomaly detection, resistance to distributional shifts, interpretability, data efficiency, and superhuman uncertainty quantification. Our approach outperforms state-of-the-art discriminative models in anomaly detection (AUC 0.990 vs. 0.918), resistance to domain shifts (85.85% vs. 74.38% balanced accuracy), and performance in low-data regimes (95.88% vs. 94.95% balanced accuracy). Notably, our model generates synthetic blood cell images that are nearly indistinguishable from real images, as demonstrated by an authenticity test in which expert haematologists achieved only 52.3% accuracy (95% CI: [50.5%, 54.2%]) in distinguishing real from generated images. Furthermore, we enhance model explainability through the generation of directly interpretable counterfactual heatmaps. Our comprehensive evaluation framework, encompassing these multiple performance dimensions, establishes a new benchmark for medical image analysis in haematology, ultimately enabling improved diagnostic accuracy in clinical settings. Our code is available at https://github.com/CambridgeCIA/CytoDiffusion.
LGAug 15, 2023
REFORMS: Reporting Standards for Machine Learning Based ScienceSayash Kapoor, Emily Cantrell, Kenny Peng et al. · princeton
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.
LGJun 16, 2022
Classification of datasets with imputed missing values: does imputation quality matter?Tolou Shadbahr, Michael Roberts, Jan Stanczuk et al.
Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.
LGOct 4, 2023
Recent Methodological Advances in Federated Learning for HealthcareFan Zhang, Daniel Kreuter, Yichen Chen et al.
For healthcare datasets, it is often not possible to combine data samples from multiple sites due to ethical, privacy or logistical concerns. Federated learning allows for the utilisation of powerful machine learning algorithms without requiring the pooling of data. Healthcare data has many simultaneous challenges which require new methodologies to address, such as highly-siloed data, class imbalance, missing data, distribution shifts and non-standardised variables. Federated learning adds significant methodological complexity to conventional centralised machine learning, requiring distributed optimisation, communication between nodes, aggregation of models and redistribution of models. In this systematic review, we consider all papers on Scopus that were published between January 2015 and February 2023 and which describe new federated learning methodologies for addressing challenges with healthcare data. We performed a detailed review of the 89 papers which fulfilled these criteria. Significant systemic issues were identified throughout the literature which compromise the methodologies in many of the papers reviewed. We give detailed recommendations to help improve the quality of the methodology development for federated learning in healthcare.
NANov 6, 2018
A Convex Geodesic Selective Model for Image SegmentationMichael Roberts, Ke Chen, Klaus L. Irion
Selective segmentation is an important application of image processing. In contrast to global segmentation in which all objects are segmented, selective segmentation is used to isolate specific objects in an image and is of particular interest in medical imaging -- permitting segmentation and review of a single organ. An important consideration is to minimise the amount of user input to obtain the segmentation; this differs from interactive segmentation in which more user input is allowed than selective segmentation. To achieve selection, we propose a selective segmentation model which uses the edge-weighted geodesic distance from a marker set as a penalty term. It is demonstrated that this edge-weighted geodesic penalty term improves on previous selective penalty terms. A convex formulation of the model is also presented, allowing arbitrary initialisation. It is shown that the proposed model is less parameter dependent and requires less user input than previous models. Further modifications are made to the edge-weighted geodesic distance term to ensure segmentation robustness to noise and blur. We can show that the overall Euler-Lagrange equation admits a unique viscosity solution. Numerical results show that the result is robust to user input and permits selective segmentations that are not possible with other models.
SEOct 21, 2022
Navigating the challenges in creating complex data systems: a development philosophySören Dittmer, Michael Roberts, Julian Gilbey et al.
In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in general, hard. Based on these insights, we identify how SE addresses those difficulties and how we can apply and generalize SE methods to construct DSSs that are fit for purpose. We advocate two key development philosophies, namely that one should incrementally grow -- not biphasically plan and build -- DSSs, and one should always employ two types of feedback loops during development: one which tests the code's correctness and another that evaluates the code's efficacy.
NANov 6, 2018
Multigrid Algorithm Based on Hybrid Smoothers for Variational and Selective Segmentation ModelsMichael Roberts, Ke Chen, Klaus L. Irion
Automatic segmentation of an image to identify all meaningful parts is one of the most challenging as well as useful tasks in a number of application areas. This is widely studied. Selective segmentation, less studied, aims to use limited user specified information to extract one or more interesting objects (instead of all objects). Constructing a fast solver remains a challenge for both classes of model. However our primary concern is on selective segmentation. In this work, we develop an effective multigrid algorithm, based on a new non-standard smoother to deal with non-smooth coefficients, to solve the underlying partial differential equations (PDEs) of a class of variational segmentation models in the level set formulation. For such models, non-smoothness (or jumps) is typical as segmentation is only possible if edges (jumps) are present. In comparison with previous multigrid methods which were shown to produce an acceptable {\it mean} smoothing rate for related models, the new algorithm can ensure a small and {\it global} smoothing rate that is a sufficient condition for convergence. Our rate analysis is by Local Fourier Analysis and, with it, we design the corresponding iterative solver, improving on an ineffective line smoother. Numerical tests show that the new algorithm outperforms multigrid methods based on competing smoothers.
LGJul 25, 2023
Reinterpreting survival analysis in the universal approximator ageSören Dittmer, Michael Roberts, Jacobus Preller et al.
Survival analysis is an integral part of the statistical toolbox. However, while most domains of classical statistics have embraced deep learning, survival analysis only recently gained some minor attention from the deep learning community. This recent development is likely in part motivated by the COVID-19 pandemic. We aim to provide the tools needed to fully harness the potential of survival analysis in deep learning. On the one hand, we discuss how survival analysis connects to classification and regression. On the other hand, we provide technical tools. We provide a new loss function, evaluation metrics, and the first universal approximating network that provably produces survival curves without numeric integration. We show that the loss function and model outperform other approaches using a large numerical study.
LGJun 15, 2023
Dis-AE: Multi-domain & Multi-task Generalisation on Real-World Clinical DataDaniel Kreuter, Samuel Tull, Julian Gilbey et al.
Clinical data is often affected by clinically irrelevant factors such as discrepancies between measurement devices or differing processing methods between sites. In the field of machine learning (ML), these factors are known as domains and the distribution differences they cause in the data are known as domain shifts. ML models trained using data from one domain often perform poorly when applied to data from another domain, potentially leading to wrong predictions. As such, developing machine learning models that can generalise well across multiple domains is a challenging yet essential task in the successful application of ML in clinical practice. In this paper, we propose a novel disentangled autoencoder (Dis-AE) neural network architecture that can learn domain-invariant data representations for multi-label classification of medical measurements even when the data is influenced by multiple interacting domain shifts at once. The model utilises adversarial training to produce data representations from which the domain can no longer be determined. We evaluate the model's domain generalisation capabilities on synthetic datasets and full blood count (FBC) data from blood donors as well as primary and secondary care patients, showing that Dis-AE improves model generalisation on multiple domains simultaneously while preserving clinically relevant information.
49.0ROMar 20
Why Cognitive Robotics Matters: Lessons from OntoAgent and LLM Deployment in HARMONIC for Safety-Critical Robot TeamingSanjay Oruganti, Sergei Nirenburg, Marjorie McShane et al.
Deploying embodied AI agents in the physical world demands cognitive capabilities for long-horizon planning that execute reliably, deterministically, and transparently. We present HARMONIC, a cognitive-robotic architecture that pairs OntoAgent, a content-centric cognitive architecture providing metacognitive self-monitoring, domain-grounded diagnosis, and consequence-based action selection over ontologically structured knowledge, with a modular reactive tactical layer. HARMONIC's modular design enables a functional evaluation of whether LLMs can replicate OntoAgent's cognitive capabilities, evaluated within the same robotic system under identical conditions. Six LLMs spanning frontier and efficient tiers replace OntoAgent in a collaborative maintenance scenario under native and knowledge-equalized conditions. Results reveal that LLMs do not consistently assess their own knowledge state before acting, causing downstream failures in diagnostic reasoning and action selection. These deficits persist even with equivalent procedural knowledge, indicating the issues are architectural rather than knowledge-based. These findings support the design of physically embodied systems in which cognitive architectures retain primary authority for reasoning, owing to their deterministic and transparent characteristics.
17.6LGMay 20
Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency PredictionFan Zhang, Simon Deltadahl, Majid Lotfian Delouee et al.
Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA$^3$, a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.
LGNov 18, 2022
Estimating defection in subscription-type markets: empirical analysis from the scholarly publishing industryMichael Roberts, J. Ignacio Deza, Hisham Ihshaish et al.
We present the first empirical study on customer churn prediction in the scholarly publishing industry. The study examines our proposed method for prediction on a customer subscription data over a period of 6.5 years, which was provided by a major academic publisher. We explore the subscription-type market within the context of customer defection and modelling, and provide analysis of the business model of such markets, and how these characterise the academic publishing business. The proposed method for prediction attempts to provide inference of customer's likelihood of defection on the basis of their re-sampled use of provider resources -in this context, the volume and frequency of content downloads. We show that this approach can be both accurate as well as uniquely useful in the business-to-business context, with which the scholarly publishing business model shares similarities. The main findings of this work suggest that whilst all predictive models examined, especially ensemble methods of machine learning, achieve substantially accurate prediction of churn, nearly a year ahead, this can be furthermore achieved even when the specific behavioural attributes that can be associated to each customer probability to churn are overlooked. Allowing as such highly accurate inference of churn from minimal possible data. We show that modelling churn on the basis of re-sampling customers' use of resources over subscription time is a better (simplified) approach than when considering the high granularity that can often characterise consumption behaviour.
LGDec 19, 2023
The curious case of the test set AUROCMichael Roberts, Alon Hazan, Sören Dittmer et al.
Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. However, we argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.
IVOct 31, 2024
Parameter choices in HaarPSI for IQA with medical imagesClemens Karner, Janek Gröhl, Ian Selby et al.
When developing machine learning models, image quality assessment (IQA) measures are a crucial component for the evaluation of obtained output images. However, commonly used full-reference IQA (FR-IQA) measures have been primarily developed and optimized for natural images. In many specialized settings, such as medical images, this poses an often overlooked problem regarding suitability. In previous studies, the FR-IQA measure HaarPSI showed promising behavior regarding generalizability. The measure is based on Haar wavelet representations and the framework allows optimization of two parameters. So far, these parameters have been aligned for natural images. Here, we optimize these parameters for two medical image data sets, a photoacoustic and a chest X-ray data set, with IQA expert ratings. We observe that they lead to similar parameter values, different to the natural image data, and are more sensitive to parameter changes. We denote the novel optimized setting as HaarPSI$_{MED}$, which improves the performance of the employed medical images significantly (p<0.05). Additionally, we include an independent CT test data set that illustrates the generalizability of HaarPSI$_{MED}$, as well as visual examples that qualitatively demonstrate the improvement. The results suggest that adapting common IQA measures within their frameworks for medical images can provide a valuable, generalizable addition to employment of more specific task-based measures.
LGNov 17, 2025
Laplace Learning in Wasserstein SpaceMary Chriselda Antony Oliver, Michael Roberts, Carola-Bibiane Schönlieb et al.
The manifold hypothesis posits that high-dimensional data typically resides on low-dimensional sub spaces. In this paper, we assume manifold hypothesis to investigate graph-based semi-supervised learning methods. In particular, we examine Laplace Learning in the Wasserstein space, extending the classical notion of graph-based semi-supervised learning algorithms from finite-dimensional Euclidean spaces to an infinite-dimensional setting. To achieve this, we prove variational convergence of a discrete graph p- Dirichlet energy to its continuum counterpart. In addition, we characterize the Laplace-Beltrami operator on asubmanifold of the Wasserstein space. Finally, we validate the proposed theoretical framework through numerical experiments conducted on benchmark datasets, demonstrating the consistency of our classification performance in high-dimensional settings.
MLApr 7, 2025
SurvSurf: a partially monotonic neural network for first-hitting time prediction of intermittently observed discrete and continuous sequential eventsYichen Kelly Chen, Sören Dittmer, Kinga Bernatowicz et al.
We propose a neural-network based survival model (SurvSurf) specifically designed for direct and simultaneous probabilistic prediction of the first hitting time of sequential events from baseline. Unlike existing models, SurvSurf is theoretically guaranteed to never violate the monotonic relationship between the cumulative incidence functions of sequential events, while allowing nonlinear influence from predictors. It also incorporates implicit truths for unobserved intermediate events in model fitting, and supports both discrete and continuous time and events. We also identified a variant of the Integrated Brier Score (IBS) that showed robust correlation with the mean squared error (MSE) between the true and predicted probabilities by accounting for implied truths about the missing intermediate events. We demonstrated the superiority of SurvSurf compared to modern and traditional predictive survival models in two simulated datasets and two real-world datasets, using MSE, the more robust IBS and by measuring the extent of monotonicity violation.
IVJan 24, 2025
Review and Recommendations for using Artificial Intelligence in Intracoronary Optical Coherence Tomography AnalysisXu Chen, Yuan Huang, Benn Jessney et al.
Artificial intelligence (AI) methodologies hold great promise for the rapid and accurate diagnosis of coronary artery disease (CAD) from intravascular optical coherent tomography (IVOCT) images. Numerous papers have been published describing AI-based models for different diagnostic tasks, yet it remains unclear which models have potential clinical utility and have been properly validated. This systematic review considered published literature between January 2015 and February 2023 describing AI-based diagnosis of CAD using IVOCT. Our search identified 5,576 studies, with 513 included after initial screening and 35 studies included in the final systematic review after quality screening. Our findings indicate that most of the identified models are not currently suitable for clinical use, primarily due to methodological flaws and underlying biases. To address these issues, we provide recommendations to improve model quality and research practices to enhance the development of clinically useful AI products.
AIJan 17, 2022
Data Harmonisation for Information Fusion in Digital Healthcare: A State-of-the-Art Systematic Review, Meta-Analysis and Future Research DirectionsYang Nan, Javier Del Ser, Simon Walsh et al.
Removing the bias and variance of multicentre data has always been a challenge in large scale digital healthcare studies, which requires the ability to integrate clinical features extracted from data acquired by different scanners and protocols to improve stability and robustness. Previous studies have described various computational approaches to fuse single modality multicentre datasets. However, these surveys rarely focused on evaluation metrics and lacked a checklist for computational data harmonisation studies. In this systematic review, we summarise the computational data harmonisation approaches for multi-modality data in the digital healthcare field, including harmonisation strategies and evaluation metrics based on different theories. In addition, a comprehensive checklist that summarises common practices for data harmonisation studies is proposed to guide researchers to report their research findings more effectively. Last but not least, flowcharts presenting possible ways for methodology and metric selection are proposed and the limitations of different methods have been surveyed for future research.
AINov 18, 2021
Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial IntelligenceXiang Bai, Hanchen Wang, Liya Ma et al.
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution under a federated learning framework (FL) without data sharing. Here we show that our FL model outperformed all the local models by a large yield (test sensitivity /specificity in China: 0.973/0.951, in the UK: 0.730/0.942), achieving comparable performance with a panel of professional radiologists. We further evaluated the model on the hold-out (collected from another two hospitals leaving out the FL) and heterogeneous (acquired with contrast materials) data, provided visual explanations for decisions made by the model, and analysed the trade-offs between the model performance and the communication costs in the federated training process. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK. Collectively, our work advanced the prospects of utilising federated learning for privacy-preserving AI in digital health.
LGAug 14, 2020
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scansMichael Roberts, Derek Driggs, Matthew Thorpe et al.
Machine learning methods offer great promise for fast and accurate detection and prognostication of COVID-19 from standard-of-care chest radiographs (CXR) and computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we search EMBASE via OVID, MEDLINE via PubMed, bioRxiv, medRxiv and arXiv for published papers and preprints uploaded from January 1, 2020 to October 3, 2020 which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 61 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher quality model development and well documented manuscripts.
MEApr 19, 2020
Sequential hypothesis testing in machine learning, and crude oil price jump size detectionMichael Roberts, Indranil SenGupta
In this paper we present a sequential hypothesis test for the detection of general jump size distrubution. Infinitesimal generators for the corresponding log-likelihood ratios are presented and analyzed. Bounds for infinitesimal generators in terms of super-solutions and sub-solutions are computed. This is shown to be implementable in relation to various classification problems for a crude oil price data set. Machine and deep learning algorithms are implemented to extract a specific deterministic component from the crude oil data set, and the deterministic component is implemented to improve the Barndorff-Nielsen and Shephard model, a commonly used stochastic model for derivative and commodity market analysis.
CVNov 21, 2018
Chan-Vese Reformulation for Selective Image SegmentationMichael Roberts, Jack Spencer
Selective segmentation involves incorporating user input to partition an image into foreground and background, by discriminating between objects of a similar type. Typically, such methods involve introducing additional constraints to generic segmentation approaches. However, we show that this is often inconsistent with respect to common assumptions about the image. The proposed method introduces a new fitting term that is more useful in practice than the Chan-Vese framework. In particular, the idea is to define a term that allows for the background to consist of multiple regions of inhomogeneity. We provide comparitive experimental results to alternative approaches to demonstrate the advantages of the proposed method, broadening the possible application of these methods.