Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring
This addresses fairness and safety concerns in medical AI by detecting subtle within-distribution errors for clinician review, though it is incremental as it builds on consistency-based methods.
The paper tackled the problem of hidden failures in chest radiograph models by proposing an augmentation-sensitivity risk scoring framework, which identified error-prone cases with substantially lower recall (e.g., -0.2 to -0.3) despite high AUROC and confidence.
Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches -- based on confidence calibration or out-of-distribution (OOD) detection -- struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.