CY AI CL LGMay 29, 2025

Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs

Nariman Naderi, Zahra Atf, Peter R Lewis, Aref Mahjoub far, Seyed Amir Ahmad Safavi-Naini, Ali Soroush

arXiv:2506.00072v12.32 citationsh-index: 9EXTRAAMAS

Originality Synthesis-oriented

AI Analysis

This addresses the problem of unreliable confidence in LLMs for high-stakes medical decision-making, but it is incremental as it evaluates existing methods on new data.

This paper investigated how prompt engineering techniques affect accuracy and confidence in medical LLMs, finding that Chain-of-Thought prompts improved accuracy but caused overconfidence, with smaller models like Llama-3.1-8b underperforming across all metrics.

This paper investigates how prompt engineering techniques impact both accuracy and confidence elicitation in Large Language Models (LLMs) applied to medical contexts. Using a stratified dataset of Persian board exam questions across multiple specialties, we evaluated five LLMs - GPT-4o, o3-mini, Llama-3.3-70b, Llama-3.1-8b, and DeepSeek-v3 - across 156 configurations. These configurations varied in temperature settings (0.3, 0.7, 1.0), prompt styles (Chain-of-Thought, Few-Shot, Emotional, Expert Mimicry), and confidence scales (1-10, 1-100). We used AUC-ROC, Brier Score, and Expected Calibration Error (ECE) to evaluate alignment between confidence and actual performance. Chain-of-Thought prompts improved accuracy but also led to overconfidence, highlighting the need for calibration. Emotional prompting further inflated confidence, risking poor decisions. Smaller models like Llama-3.1-8b underperformed across all metrics, while proprietary models showed higher accuracy but still lacked calibrated confidence. These results suggest prompt engineering must address both accuracy and uncertainty to be effective in high-stakes medical tasks.

View on arXiv PDF

Similar