Detecting Domain Shift in Multiple Instance Learning for Digital Pathology Using Fréchet Domain Distance
This addresses the problem of unreliable MIL system performance due to domain shifts for care providers and vendors in digital pathology, offering a tool to verify reliability without additional annotations, though it is incremental as it builds on existing shift detection methods.
The study investigated the sensitivity of multiple-instance learning (MIL) to domain shifts in digital pathology, showing that clinically realistic differences affect performance, and proposed the Fréchet Domain Distance (FDD) metric, which achieved a mean Pearson correlation of 0.70 for detecting performance changes, outperforming baselines like Deep ensemble (0.45) and Representation shift (0.56).
Multiple-instance learning (MIL) is an attractive approach for digital pathology applications as it reduces the costs related to data collection and labelling. However, it is not clear how sensitive MIL is to clinically realistic domain shifts, i.e., differences in data distribution that could negatively affect performance, and if already existing metrics for detecting domain shifts work well with these algorithms. We trained an attention-based MIL algorithm to classify whether a whole-slide image of a lymph node contains breast tumour metastases. The algorithm was evaluated on data from a hospital in a different country and various subsets of this data that correspond to different levels of domain shift. Our contributions include showing that MIL for digital pathology is affected by clinically realistic differences in data, evaluating which features from a MIL model are most suitable for detecting changes in performance, and proposing an unsupervised metric named Fréchet Domain Distance (FDD) for quantification of domain shifts. Shift measure performance was evaluated through the mean Pearson correlation to change in classification performance, where FDD achieved 0.70 on 10-fold cross-validation models. The baselines included Deep ensemble, Difference of Confidence, and Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson correlation, respectively. FDD could be a valuable tool for care providers and vendors who need to verify if a MIL system is likely to perform reliably when implemented at a new site, without requiring any additional annotations from pathologists.