On the calibration of neural networks for histological slide-level classification
This work addresses the need for reliable uncertainty communication in medical AI, though it is incremental as it compares existing architectures on a specific task.
The study tackled the problem of evaluating neural network calibration for slide-level classification in digital pathology, specifically predicting Microsatellite Instability from colorectal cancer tissue, and found that Transformers achieved good classification performance and calibration but tended to produce overconfident predictions.
Deep Neural Networks have shown promising classification performance when predicting certain biomarkers from Whole Slide Images in digital pathology. However, the calibration of the networks' output probabilities is often not evaluated. Communicating uncertainty by providing reliable confidence scores is of high relevance in the medical context. In this work, we compare three neural network architectures that combine feature representations on patch-level to a slide-level prediction with respect to their classification performance and evaluate their calibration. As slide-level classification task, we choose the prediction of Microsatellite Instability from Colorectal Cancer tissue sections. We observe that Transformers lead to good results in terms of classification performance and calibration. When evaluating the classification performance on a separate dataset, we observe that Transformers generalize best. The investigation of reliability diagrams provides additional insights to the Expected Calibration Error metric and we observe that especially Transformers push the output probabilities to extreme values, which results in overconfident predictions.