AIAug 21, 2024

Probabilistic Medical Predictions of Large Language Models

Bowen Gu, Rishi J. Desai, Kueiyu Joshua Lin, Jie Yang

Harvard

arXiv:2408.11316v216.037 citationsh-index: 23Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of unreliable probability estimates in LLMs for clinical decision-making, highlighting an incremental need for improved methods.

The study found that large language models (LLMs) produce unreliable explicit probability estimates for medical predictions, with implicit probabilities derived from token likelihoods outperforming them in discrimination, precision, and recall across six LLMs and five datasets, especially for smaller models and imbalanced data.

Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability estimation methods, and further research for clinical use of LLMs.

View on arXiv PDF

Similar