CLSep 5, 2023

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

arXiv:2309.02077v117 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the need for accurate LLMs in medical applications, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of hallucinations in large language models (LLMs) during multi-turn medical consultations by introducing an automated evaluation framework, showing that fine-tuning with a constructed training set improves LLMs' performance on a benchmark derived from USMLE questions.

Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes