CL AI HCNov 8, 2024

Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators

Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum, Aidan Gilson, Reid McMurry, R. Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu

arXiv:2411.05897v21.01 citationsh-index: 2Has CodeAMIA ... Annual Symposium proceedings. AMIA Symposium

Originality Synthesis-oriented

AI Analysis

This addresses the need for reliable AI tools in healthcare decision-making, but it is incremental as it shows LLMs are not yet superior to humans in this specific task.

The study tackled the problem of assessing large language models (LLMs) for clinical decision support, specifically in selecting medical calculators, and found that the highest-performing LLM achieved 66.0% accuracy, while human annotators outperformed with 79.5% accuracy.

Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (49.3% of errors) and calculator knowledge (7.1% of errors), our findings highlight that LLMs are not superior to humans in calculator recommendation.

View on arXiv PDF

Similar