IV AI CVJun 29, 2025

Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel

arXiv:2506.23298v315.25 citationsh-index: 38Has CodeMICCAI

Originality Incremental advance

AI Analysis

This addresses safety and fairness issues for deploying MLLMs in clinical practice, focusing on demographic subgroups, and is incremental as it builds on existing calibration and fairness methods.

The paper tackles the problem of calibration biases and demographic unfairness in multimodal large language models (MLLMs) for few-shot in-context learning in medical image classification, introducing CALIN, an inference-time calibration method that improves overall prediction accuracies and ensures fair confidence calibration with minimal fairness-utility trade-off across three medical imaging datasets.

Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

View on arXiv PDF Code

Similar