IVCVAug 14, 2023

Robustness Stress Testing in Medical Image Classification

arXiv:2308.06889v210 citationsh-index: 61
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better validation of disease detection models in medical imaging to ensure generalizability and fairness, though it is incremental as it applies existing stress testing concepts to this domain.

The paper tackled the problem of assessing the robustness and subgroup performance disparities of deep neural networks in medical image classification by employing progressive stress testing with image perturbations, finding that some models yield more robust and equitable performance and that pretraining characteristics influence downstream robustness.

Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes