IV CV LGSep 4, 2024

Evaluating Machine Learning-based Skin Cancer Diagnosis

arXiv:2409.03794v13.63 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

It addresses reliability issues in AI-based medical diagnosis for diverse populations, though it is incremental in applying existing fairness methods.

This study evaluated two deep learning models for skin cancer detection on the HAM10000 dataset, finding that while they generally highlighted relevant features for most lesion types, they showed significant fairness disparities across skin tones, with a postprocessing strategy reducing false negative rate differences.

This study evaluates the reliability of two deep learning models for skin cancer detection, focusing on their explainability and fairness. Using the HAM10000 dataset of dermatoscopic images, the research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model. Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions. Explainability is assessed using Saliency Maps and Integrated Gradients, with results interpreted by a dermatologist. The study finds that both models generally highlight relevant features for most lesion types, although they struggle with certain classes like seborrheic keratoses and vascular lesions. Fairness is evaluated using the Equalized Odds metric across sex and skin tone groups. While both models demonstrate fairness across sex groups, they show significant disparities in false positive and false negative rates between light and dark skin tones. A Calibrated Equalized Odds postprocessing strategy is applied to mitigate these disparities, resulting in improved fairness, particularly in reducing false negative rate differences. The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones. These findings underscore the importance of rigorous evaluation of AI models in medical applications, particularly in diverse population groups.

View on arXiv PDF

Similar