CVApr 1

When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

Loris Cino, Pier Luigi Mazzeo, Alessandro Martella, Giulia Radi, Renato Rossi, Cosimo Distante

arXiv:2604.0065117.3h-index: 35

AI Analysis

This addresses the challenge of unreliable diagnosis in dermatology for clinicians and AI developers, revealing that some errors are due to inherent image complexity rather than algorithmic flaws.

The study tackled the problem of intrinsic ambiguity in dermatoscopic images by showing that both AI models and human experts systematically misclassify a subset of images, with expert agreement dropping from a Cohen's kappa of 0.61 to 0.08 and inter-rater reliability falling from 0.456 to 0.275.

The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

View on arXiv PDF

Similar