CVJun 3, 2022
Metrics reloaded: Recommendations for image analysis validationLena Maier-Hein, Annika Reinke, Patrick Godau et al. · utoronto
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.
CVFeb 3, 2023
Understanding metric-related pitfalls in image analysis validationAnnika Reinke, Minu D. Tizabi, Michael Baumgartner et al.
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
30.9SEMar 16
Code Sharing In Prediction Model Research: A Scoping ReviewThomas Sounack, Raffaele Giancotti, Catherine A. Gao et al.
Analytical code is essential for reproducing diagnostic and prognostic prediction model research, yet code availability in the published literature remains limited. While the TRIPOD statements set standards for reporting prediction model methods, they do not define explicit standards for repository structure and documentation. This review quantifies current code-sharing practices to inform the development of TRIPOD-Code, a TRIPOD extension reporting guideline focused on code sharing. We conducted a scoping review of PubMed-indexed articles citing TRIPOD or TRIPOD+AI as of Aug 11, 2025, restricted to studies retrievable via the PubMed Central Open Access API. Eligible studies developed, updated, or validated multivariable prediction models. A large language model-assisted pipeline was developed to screen articles and extract code availability statements and repository links. Repositories were assessed with the same LLM against 14 predefined reproducibility-related features. Our code is made publicly available. Among 3,967 eligible articles, 12.2% included code sharing statements. Code sharing increased over time, reaching 15.8% in 2025, and was higher among TRIPOD+AI-citing studies than TRIPOD-citing studies. Sharing prevalence varied widely by journal and country. Repository assessment showed substantial heterogeneity in reproducibility features: most repositories contained a README file (80.5%), but fewer specified dependencies (37.6%; version-constrained 21.6%) or were modular (42.4%). In prediction model research, code sharing remains relatively uncommon, and when shared, often falls short of being reusable. These findings provide an empirical baseline for the TRIPOD-Code extension and underscore the need for clearer expectations beyond code availability, including documentation, dependency specification, licensing, and executable structure.
LGDec 13, 2024
Performance evaluation of predictive AI models to support medical decisions: Overview and guidanceBen Van Calster, Gary S. Collins, Andrew J. Vickers et al.
A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a "proper" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.
IVApr 12, 2021
Common Limitations of Image Processing Metrics: A Picture StoryAnnika Reinke, Minu D. Tizabi, Carole H. Sudre et al.
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide.
HCMar 2, 2020
Decision Support in the Context of a Complex Decision SituationTeus H. Kappen, Mirko Noordegraaf, Wilton A. van Klei et al.
The aim of a clinical decision support tool is to reduce the complexity of clinical decisions. However, when decision support tools are poorly implemented they may actually confuse physicians and complicate clinical care. This paper argues that information from decision support tools is often removed from the clinical context of the targeted decisions. Physicians largely depend on clinical context to handle the complexity of their day-to-day decisions. Clinical context enables them to take into account all ambiguous information and patient preferences. Decision support tools that provide analytic information to physicians, without its context, may then complicate the decision process of physicians. It is likely that the joint forces of physicians and technology will produce better decisions than either of them exclusively: after all, they do have different ways of dealing with the complexity of a decision and are thus complementary. Therefore, the future challenges of decision support do not only reside in the optimization of the predictive value of the underlying models and algorithms, but equally in the effective communication of information and its context to doctors.