Assessing Reliability and Challenges of Uncertainty Estimations for Medical Image Segmentation
This addresses the problem of unreliable failure detection in deep learning systems for medical image segmentation, which is critical for clinical integration, but it is incremental as it benchmarks existing methods.
The paper evaluated common uncertainty estimation methods for medical image segmentation and found they perform similarly, are well-calibrated at the dataset level but miscalibrated at the subject level, compromising reliability.
Despite the recent improvements in overall accuracy, deep learning systems still exhibit low levels of robustness. Detecting possible failures is critical for a successful clinical integration of these systems, where each data point corresponds to an individual patient. Uncertainty measures are a promising direction to improve failure detection since they provide a measure of a system's confidence. Although many uncertainty estimation methods have been proposed for deep learning, little is known on their benefits and current challenges for medical image segmentation. Therefore, we report results of evaluating common voxel-wise uncertainty measures with respect to their reliability, and limitations on two medical image segmentation datasets. Results show that current uncertainty methods perform similarly and although they are well-calibrated at the dataset level, they tend to be miscalibrated at subject-level. Therefore, the reliability of uncertainty estimates is compromised, highlighting the importance of developing subject-wise uncertainty estimations. Additionally, among the benchmarked methods, we found auxiliary networks to be a valid alternative to common uncertainty methods since they can be applied to any previously trained segmentation model.