Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
This work addresses trustworthiness issues in high-stakes medical QA tasks, though it is incremental as it builds on existing conformal prediction methods.
The paper tackles the problem of hallucinations and nonfactual information in large language models (LLMs) for medical multiple-choice question answering by proposing an enhanced conformal prediction framework. The method meets specified error rate guarantees while reducing average prediction set size with increased risk level, as evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs.
Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs.