CL AIApr 25, 2024

Large Language Models in the Clinic: A Comprehensive Benchmark

Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag

arXiv:2405.00716v48.227 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better benchmarks to assess LLMs in healthcare, which is crucial for clinicians and researchers, though it is incremental as it builds on existing datasets and tasks.

The authors tackled the problem of evaluating large language models (LLMs) in clinical settings by creating ClinicBench, a comprehensive benchmark that includes both existing and novel datasets for tasks like open-ended decision-making and long document processing, and they found that LLMs show varying performance across these tasks, with expert evaluations highlighting their clinical usefulness.

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

View on arXiv PDF Code

Similar