CLMar 7, 2025

Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework

arXiv:2503.05505v21 citationsh-index: 3Mathematics
Originality Incremental advance
AI Analysis

This work addresses trustworthiness issues in high-stakes medical QA tasks, though it is incremental as it builds on existing conformal prediction methods.

The paper tackles the problem of hallucinations and nonfactual information in large language models (LLMs) for medical multiple-choice question answering by proposing an enhanced conformal prediction framework. The method meets specified error rate guarantees while reducing average prediction set size with increased risk level, as evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs.

Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes