LGApr 10

Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

arXiv:2604.0894152.8h-index: 16

Predicted impact top 47% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses reliability issues for safe deployment of medical AI systems, though it is incremental as it builds on existing uncertainty quantification methods.

The paper shows that predictive entropy from a single forward pass can identify both miscalibrated confidence and sensitivity to question rephrasing in medical vision-language models, achieving AUROC scores of 0.711 on MedGemma and 0.878 on LLaVA-RAD for predicting paraphrase sensitivity.

Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.

View on arXiv PDF

Similar