LGFeb 13
Extending confidence calibration to generalised measures of variationAndrew Thompson, Vivek Desai
We propose the Variation Calibration Error (VCE) metric for assessing the calibration of machine learning classifiers. The metric can be viewed as an extension of the well-known Expected Calibration Error (ECE) which assesses the calibration of the maximum probability or confidence. Other ways of measuring the variation of a probability distribution exist which have the advantage of taking into account the full probability distribution, for example the Shannon entropy. We show how the ECE approach can be extended from assessing confidence calibration to assessing the calibration of any metric of variation. We present numerical examples upon synthetic predictions which are perfectly calibrated by design, demonstrating that, in this scenario, the VCE has the desired property of approaching zero as the number of data samples increases, in contrast to another entropy-based calibration metric (the UCE) which has been proposed in the literature.
LGJan 29
Investigation into using stochastic embedding representations for evaluating the trustworthiness of the Fréchet Inception DistanceCiaran Bench, Vivek Desai, Carlijn Roozemond et al.
Feature embeddings acquired from pretrained models are widely used in medical applications of deep learning to assess the characteristics of datasets; e.g. to determine the quality of synthetic, generated medical images. The Fréchet Inception Distance (FID) is one popular synthetic image quality metric that relies on the assumption that the characteristic features of the data can be detected and encoded by an InceptionV3 model pretrained on ImageNet1K (natural images). While it is widely known that this makes it less effective for applications involving medical images, the extent to which the metric fails to capture meaningful differences in image characteristics is not obviously known. Here, we use Monte Carlo dropout to compute the predictive variance in the FID as well as a supplemental estimate of the predictive variance in the feature embedding model's latent representations. We show that the magnitudes of the predictive variances considered exhibit varying degrees of correlation with the extent to which test inputs (ImageNet1K validation set augmented at various strengths, and other external datasets) are out-of-distribution relative to its training data, providing some insight into the effectiveness of their use as indicators of the trustworthiness of the FID.
LGOct 31, 2025
A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysisCiaran Bench, Oskar Pfeffer, Vivek Desai et al.
In principle, deep learning models trained on medical time-series, including wearable photoplethysmography (PPG) sensor data, can provide a means to continuously monitor physiological parameters outside of clinical settings. However, there is considerable risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes. Reliable uncertainties accompanying predictions can provide guidance to clinicians in their interpretation of the trustworthiness of model outputs. It is therefore of interest to compare the effectiveness of different approaches. Here we implement an unprecedented set of eight uncertainty quantification (UQ) techniques to models trained on two clinically relevant prediction tasks: Atrial Fibrillation (AF) detection (classification), and two variants of blood pressure regression. We formulate a comprehensive evaluation procedure to enable a rigorous comparison of these approaches. We observe a complex picture of uncertainty reliability across the different techniques, where the most optimal for a given task depends on the chosen expression of uncertainty, evaluation metric, and scale of reliability assessed. We find that assessing local calibration and adaptivity provides practically relevant insights about model behaviour that otherwise cannot be acquired using more commonly implemented global reliability metrics. We emphasise that criteria for evaluating UQ techniques should cater to the model's practical use case, where the use of a small number of measurements per patient places a premium on achieving small-scale reliability for the chosen expression of uncertainty, while preserving as much predictive performance as possible.