IV CV LGDec 14, 2020

D-LEMA: Deep Learning Ensembles from Multiple Annotations -- Application to Skin Lesion Segmentation

Zahra Mirikharaji, Kumar Abhishek, Saeed Izadi, Ghassan Hamarneh

arXiv:2012.07206v212.138 citations

Originality Incremental advance

AI Analysis

This work is significant for medical image analysis researchers and practitioners, as it provides a method to robustly train segmentation models using multiple, potentially conflicting, expert annotations, which is a common problem in real-world medical datasets.

This paper addresses the challenge of training deep learning models for medical image segmentation when multiple, often contradictory, annotations are available per image. The authors propose an ensemble of Bayesian fully convolutional networks (FCNs) that accounts for inter-annotator disagreements during training and improves confidence calibration through prediction fusion, demonstrating superior performance on the ISIC Archive and good generalization across PH2 and DermoFit datasets.

Medical image segmentation annotations suffer from inter- and intra-observer variations even among experts due to intrinsic differences in human annotators and ambiguous boundaries. Leveraging a collection of annotators' opinions for an image is an interesting way of estimating a gold standard. Although training deep models in a supervised setting with a single annotation per image has been extensively studied, generalizing their training to work with datasets containing multiple annotations per image remains a fairly unexplored problem. In this paper, we propose an approach to handle annotators' disagreements when training a deep model. To this end, we propose an ensemble of Bayesian fully convolutional networks (FCNs) for the segmentation task by considering two major factors in the aggregation of multiple ground truth annotations: (1) handling contradictory annotations in the training data originating from inter-annotator disagreements and (2) improving confidence calibration through the fusion of base models' predictions. We demonstrate the superior performance of our approach on the ISIC Archive and explore the generalization performance of our proposed method by cross-dataset evaluation on the PH2 and DermoFit datasets.

View on arXiv PDF

Similar