CVJun 3, 2022
Metrics reloaded: Recommendations for image analysis validationLena Maier-Hein, Annika Reinke, Patrick Godau et al. · utoronto
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.
CVDec 16, 2022
Biomedical image analysis competitions: The state of current participation practiceMatthias Eisenmann, Annika Reinke, Vivienn Weru et al. · utoronto
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
CVFeb 3, 2023
Understanding metric-related pitfalls in image analysis validationAnnika Reinke, Minu D. Tizabi, Michael Baumgartner et al.
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
IVApr 1, 2022
Personalized Prediction of Future Lesion Activity and Treatment Effect in Multiple Sclerosis from Baseline MRIJoshua Durso-Finley, Jean-Pierre R. Falet, Brennan Nichyporuk et al.
Precision medicine for chronic diseases such as multiple sclerosis (MS) involves choosing a treatment which best balances efficacy and side effects/preferences for individual patients. Making this choice as early as possible is important, as delays in finding an effective therapy can lead to irreversible disability accrual. To this end, we present the first deep neural network model for individualized treatment decisions from baseline magnetic resonance imaging (MRI) (with clinical information if available) for MS patients. Our model (a) predicts future new and enlarging T2 weighted (NE-T2) lesion counts on follow-up MRI on multiple treatments and (b) estimates the conditional average treatment effect (CATE), as defined by the predicted future suppression of NE-T2 lesions, between different treatment options relative to placebo. Our model is validated on a proprietary federated dataset of 1817 multi-sequence MRIs acquired from MS patients during four multi-centre randomized clinical trials. Our framework achieves high average precision in the binarized regression of future NE-T2 lesions on five different treatments, identifies heterogeneous treatment effects, and provides a personalized treatment recommendation that accounts for treatment-associated risk (e.g. side effects, patient preference, administration difficulties).
CVOct 31, 2022
Rethinking Generalization: The Impact of Annotation Style on Medical Image SegmentationBrennan Nichyporuk, Jillian Cardinell, Justin Szeto et al.
Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the 'ground-truth' label annotations. This is particularly important in the context of medical image segmentation of pathological structures (e.g. lesions), where the annotation process is much more subjective, and affected by a number underlying factors, including the annotation protocol, rater education/experience, and clinical aims, among others. In this paper, we show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. To this end, we propose a generalized conditioning framework to (1) learn and account for different annotation styles across multiple datasets using a single model, (2) identify similar annotation styles across different datasets in order to permit their effective aggregation, and (3) fine-tune a fully trained model to a new annotation style with just a few samples. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
CVAug 3, 2022
Counterfactual Image Synthesis for Discovery of Personalized Predictive Image MarkersAmar Kumar, Anjun Hu, Brennan Nichyporuk et al.
The discovery of patient-specific imaging markers that are predictive of future disease outcomes can help us better understand individual-level heterogeneity of disease evolution. In fact, deep learning models that can provide data-driven personalized markers are much more likely to be adopted in medical practice. In this work, we demonstrate that data-driven biomarker discovery can be achieved through a counterfactual synthesis process. We show how a deep conditional generative model can be used to perturb local imaging features in baseline images that are pertinent to subject-specific future disease evolution and result in a counterfactual image that is expected to have a different future outcome. Candidate biomarkers, therefore, result from examining the set of features that are perturbed in this process. Through several experiments on a large-scale, multi-scanner, multi-center multiple sclerosis (MS) clinical trial magnetic resonance imaging (MRI) dataset of relapsing-remitting (RRMS) patients, we demonstrate that our model produces counterfactuals with changes in imaging features that reflect established clinical markers predictive of future MRI lesional activity at the population level. Additional qualitative results illustrate that our model has the potential to discover novel and subject-specific predictive markers of future activity.
CVAug 21, 2023
Debiasing Counterfactuals In the Presence of Spurious CorrelationsAmar Kumar, Nima Fathi, Raghav Mehta et al.
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.
LGOct 24, 2023
Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-EnsemblesXing Shen, Hengguan Huang, Brennan Nichyporuk et al.
Once deployed, medical image analysis methods are often faced with unexpected image corruptions and noise perturbations. These unknown covariate shifts present significant challenges to deep learning based methods trained on "clean" images. This often results in unreliable predictions and poorly calibrated confidence, hence hindering clinical applicability. While recent methods have been developed to address specific issues such as confidence calibration or adversarial robustness, no single framework effectively tackles all these challenges simultaneously. To bridge this gap, we propose LaDiNE, a novel ensemble learning method combining the robustness of Vision Transformers with diffusion-based generative models for improved reliability in medical image classification. Specifically, transformer encoder blocks are used as hierarchical feature extractors that learn invariant features from images for each ensemble member, resulting in features that are robust to input perturbations. In addition, diffusion models are used as flexible density estimators to estimate member densities conditioned on the invariant features, leading to improved modeling of complex data distributions while retaining properly calibrated confidence. Extensive experiments on tuberculosis chest X-rays and melanoma skin cancer datasets demonstrate that LaDiNE achieves superior performance compared to a wide range of state-of-the-art methods by simultaneously improving prediction accuracy and confidence calibration under unseen noise, adversarial perturbations, and resolution degradation.
CVAug 1, 2022
Information Gain Sampling for Active Learning in Medical Image ClassificationRaghav Mehta, Changjian Shui, Brennan Nichyporuk et al.
Large, annotated datasets are not widely available in medical image analysis due to the prohibitive time, costs, and challenges associated with labelling large datasets. Unlabelled datasets are easier to obtain, and in many contexts, it would be feasible for an expert to provide labels for a small subset of images. This work presents an information-theoretic active learning framework that guides the optimal selection of images from the unlabelled pool to be labeled based on maximizing the expected information gain (EIG) on an evaluation dataset. Experiments are performed on two different medical image classification datasets: multi-class diabetic retinopathy disease scale classification and multi-class skin lesion classification. Results indicate that by adapting EIG to account for class-imbalances, our proposed Adapted Expected Information Gain (AEIG) outperforms several popular baselines including the diversity based CoreSet and uncertainty based maximum entropy sampling. Specifically, AEIG achieves ~95% of overall performance with only 19% of the training data, while other active learning approaches require around 25%. We show that, by careful design choices, our model can be integrated into existing deep learning classifiers.
CVMar 20, 2024Code
HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive ModelingDaniel Duenias, Brennan Nichyporuk, Tal Arbel et al.
The integration of diverse clinical modalities such as medical imaging and the tabular data extracted from patients' Electronic Health Records (EHRs) is a crucial aspect of modern healthcare. Integrative analysis of multiple sources can provide a comprehensive understanding of the clinical condition of a patient, improving diagnosis and treatment decision. Deep Neural Networks (DNNs) consistently demonstrate outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject's sex and multi-class Alzheimer's Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI tabular data fusion methods. A link to our code can be found at https://github.com/daniel4725/HyperFusion
CVMay 15, 2024Code
DeCoDEx: Confounder Detector Guidance for Improved Diffusion-based Counterfactual ExplanationsNima Fathi, Amar Kumar, Brennan Nichyporuk et al.
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at https://github.com/NimaFathi/DeCoDEx.
CVSep 12, 2025Code
Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI DiagnosesEmily Kaczmarek, Justin Szeto, Brennan Nichyporuk et al.
3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.
CVSep 12, 2025Code
SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and DatasetsEmily Kaczmarek, Justin Szeto, Brennan Nichyporuk et al.
Alzheimer's disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer's prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer's prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.
IVMar 30, 2021Code
HAD-Net: A Hierarchical Adversarial Knowledge Distillation Network for Improved Enhanced Tumour Segmentation Without Post-Contrast ImagesSaverio Vadacchino, Raghav Mehta, Nazanin Mohammadi Sepahvand et al.
Segmentation of enhancing tumours or lesions from MRI is important for detecting new disease activity in many clinical contexts. However, accurate segmentation requires the inclusion of medical images (e.g., T1 post contrast MRI) acquired after injecting patients with a contrast agent (e.g., Gadolinium), a process no longer thought to be safe. Although a number of modality-agnostic segmentation networks have been developed over the past few years, they have been met with limited success in the context of enhancing pathology segmentation. In this work, we present HAD-Net, a novel offline adversarial knowledge distillation (KD) technique, whereby a pre-trained teacher segmentation network, with access to all MRI sequences, teaches a student network, via hierarchical adversarial training, to better overcome the large domain shift presented when crucial images are absent during inference. In particular, we apply HAD-Net to the challenging task of enhancing tumour segmentation when access to post-contrast imaging is not available. The proposed network is trained and tested on the BraTS 2019 brain tumour segmentation challenge dataset, where it achieves performance improvements in the ranges of 16% - 26% over (a) recent modality-agnostic segmentation methods (U-HeMIS, U-HVED), (b) KD-Net adapted to this problem, (c) the pre-trained student network and (d) a non-hierarchical version of the network (AD-Net), in terms of Dice scores for enhancing tumour (ET). The network also shows improvements in tumour core (TC) Dice scores. Finally, the network outperforms both the baseline student network and AD-Net in terms of uncertainty quantification for enhancing tumour segmentation based on the BraTs 2019 uncertainty challenge metrics. Our code is publicly available at: https://github.com/SaverioVad/HAD_Net
CVFeb 6, 2025
Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for FreeGian Mario Favero, Parham Saremi, Emily Kaczmarek et al.
Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.
IVSep 7, 2025
Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language GuidanceMohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold et al.
Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - https://lesupermomo.github.io/imagining-alternatives/.
IVAug 9, 2025
Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on TreatmentsGian Mario Favero, Ge Ya Luo, Nima Fathi et al.
Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.
IVAug 2, 2021
Cohort Bias Adaptation in Aggregated Datasets for Lesion SegmentationBrennan Nichyporuk, Jillian Cardinell, Justin Szeto et al.
Many automatic machine learning models developed for focal pathology (e.g. lesions, tumours) detection and segmentation perform well, but do not generalize as well to new patient cohorts, impeding their widespread adoption into real clinical contexts. One strategy to create a more diverse, generalizable training set is to naively pool datasets from different cohorts. Surprisingly, training on this \it{big data} does not necessarily increase, and may even reduce, overall performance and model generalizability, due to the existence of cohort biases that affect label distributions. In this paper, we propose a generalized affine conditioning framework to learn and account for cohort biases across multi-source datasets, which we call Source-Conditioned Instance Normalization (SCIN). Through extensive experimentation on three different, large scale, multi-scanner, multi-centre Multiple Sclerosis (MS) clinical trial MRI datasets, we show that our cohort bias adaptation method (1) improves performance of the network on pooled datasets relative to naively pooling datasets and (2) can quickly adapt to a new cohort by fine-tuning the instance normalization parameters, thus learning the new cohort bias with only 10 labelled samples.
IVJul 27, 2021
Optimizing Operating Points for High Performance Lesion Detection and Segmentation Using Lesion Size ReweightingBrennan Nichyporuk, Justin Szeto, Douglas L. Arnold et al.
There are many clinical contexts which require accurate detection and segmentation of all focal pathologies (e.g. lesions, tumours) in patient images. In cases where there are a mix of small and large lesions, standard binary cross entropy loss will result in better segmentation of large lesions at the expense of missing small ones. Adjusting the operating point to accurately detect all lesions generally leads to oversegmentation of large lesions. In this work, we propose a novel reweighing strategy to eliminate this performance gap, increasing small pathology detection performance while maintaining segmentation accuracy. We show that our reweighing strategy vastly outperforms competing strategies based on experiments on a large scale, multi-scanner, multi-center dataset of Multiple Sclerosis patient images.
IVApr 12, 2021
Common Limitations of Image Processing Metrics: A Picture StoryAnnika Reinke, Minu D. Tizabi, Carole H. Sudre et al.
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide.
LGMar 1, 2021
Accounting for Variance in Machine Learning BenchmarksXavier Bouthillier, Pierre Delaunay, Mirko Bronzi et al.
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.