On undesired emergent behaviors in compound prostate cancer detection systems
This addresses the issue of over-optimistic evaluations in medical AI systems for prostate cancer diagnosis, highlighting performance degradation in real-world use, though it is incremental as it focuses on evaluation methodology rather than new detection methods.
The paper tackled the problem of evaluating compound prostate cancer detection systems by simulating realistic deployment scenarios, revealing that using a high-performing prostate segmentation module (DSC: 90.07±0.74) led to a significant drop in detection performance (AUC: 77.93±3.06 vs. 84.30±4.07, P<.001) compared to an idealized setting.
Artificial intelligence systems show promise to aid in the di- agnostic pathway of prostate cancer (PC), by supporting radiologists in interpreting magnetic resonance images (MRI) of the prostate. Most MRI-based systems are designed to detect clinically significant PC le- sions, with the main objective of preventing over-diagnosis. Typically, these systems involve an automatic prostate segmentation component and a clinically significant PC lesion detection component. In spite of the compound nature of the systems, evaluations are presented assum- ing a standalone clinically significant PC detection component. That is, they are evaluated in an idealized scenario and under the assumption that a highly accurate prostate segmentation is available at test time. In this work, we aim to evaluate a clinically significant PC lesion de- tection system accounting for its compound nature. For that purpose, we simulate a realistic deployment scenario and evaluate the effect of two non-ideal and previously validated prostate segmentation modules on the PC detection ability of the compound system. Following, we com- pare them with an idealized setting, where prostate segmentations are assumed to have no faults. We observe significant differences in the de- tection ability of the compound system in a realistic scenario and in the presence of the highest-performing prostate segmentation module (DSC: 90.07+-0.74), when compared to the idealized one (AUC: 77.93 +- 3.06 and 84.30+- 4.07, P<.001). Our results depict the relevance of holistic evalu- ations for PC detection compound systems, where interactions between system components can lead to decreased performance and degradation at deployment time.