Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability
This addresses the problem of unreliable diagnostic uncertainty estimation in LLMs for clinical decision support, but it is incremental as it critiques existing methods without proposing new solutions.
This study evaluated Mistral-7B and Llama3-70B on diagnostic tasks using EHR data, revealing limitations in current methods for extracting LLM probability estimations, and highlighted the need for improved confidence estimation techniques.
Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.