CVJul 14, 2023

Challenge Results Are Not Reproducible

Annika Reinke, Georg Grab, Lena Maier-Hein

arXiv:2307.07226v12.82 citationsh-index: 53

Originality Synthesis-oriented

AI Analysis

This highlights a critical issue for researchers and practitioners in medical image analysis, as challenge results may not be reliable for benchmarking, though it is incremental by building on prior analyses.

The study investigated the reproducibility of methods in medical image analysis challenges by reimplementing algorithms from the 2019 ROBUST-MIS Challenge, finding that the leaderboard rankings changed substantially, indicating poor reproducibility.

While clinical trials are the state-of-the-art methods to assess the effect of new medication in a comparative manner, benchmarking in the field of medical image analysis is performed by so-called challenges. Recently, comprehensive analysis of multiple biomedical image analysis challenges revealed large discrepancies between the impact of challenges and quality control of the design and reporting standard. This work aims to follow up on these results and attempts to address the specific question of the reproducibility of the participants methods. In an effort to determine whether alternative interpretations of the method description may change the challenge ranking, we reproduced the algorithms submitted to the 2019 Robust Medical Image Segmentation Challenge (ROBUST-MIS). The leaderboard differed substantially between the original challenge and reimplementation, indicating that challenge rankings may not be sufficiently reproducible.

View on arXiv PDF

Similar