CLFeb 15, 2024

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

arXiv:2402.09742v4109 citationsh-index: 23Has CodeCOLING
Originality Incremental advance
AI Analysis

This work addresses the challenge of assessing LLMs for real-world doctor-patient interactions in healthcare, though it is incremental as it builds on existing simulation and benchmarking methods.

The paper tackles the problem of evaluating large language models (LLMs) in realistic clinical settings by introducing AI Hospital, a multi-agent simulator with a benchmark, and finds that LLMs show significant performance gaps in multi-turn interactions compared to simpler approaches.

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes