CLAug 22, 2024

Towards Evaluating and Building Versatile Large Language Models for Medicine

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie

arXiv:2408.12547v216.869 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of adapting general LLMs to complex clinical tasks for medical professionals and researchers, offering a new benchmark and dataset to advance the field, though it is incremental in building on existing tuning methods.

The study tackled the problem of evaluating and improving large language models (LLMs) for medical applications by introducing MedS-Bench, a benchmark covering 11 clinical tasks, and found that even top models like GPT-4 struggle, then developed MedS-Ins, a 13.5 million-sample instruction tuning dataset, which when used to fine-tune a model resulted in MMedIns-Llama 3 significantly outperforming existing models across nearly all tasks.

In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. Unlike existing benchmarks that focus on multiple-choice question answering, MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation, among others. We evaluated six leading LLMs, e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using few-shot prompting, and found that even the most sophisticated models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models across nearly all clinical tasks. To promote further advancements in the application of LLMs to clinical challenges, we have made the MedS-Ins dataset fully accessible and invite the research community to contribute to its expansion.Additionally, we have launched a dynamic leaderboard for MedS-Bench, which we plan to regularly update the test set to track progress and enhance the adaptation of general LLMs to the medical domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github: https://github.com/MAGIC-AI4Med/MedS-Ins.

View on arXiv PDF Code

Similar