CLMar 30, 2024

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

arXiv:2404.00376v252 citationsh-index: 12Has Code
Originality Highly original
AI Analysis

This work addresses privacy and security concerns in medical AI by developing open-source models that narrow the performance gap with large commercial models, offering a practical solution for healthcare applications.

The authors tackled the problem of insufficient reasoning capabilities in open-source small language models for medical tasks by introducing Meerkat, a family of models trained on synthetic chain-of-thought data from medical textbooks, which achieved state-of-the-art accuracy on medical benchmarks, with Meerkat-7B passing the USMLE threshold and Meerkat-70B outperforming GPT-4 by 1.3% and diagnosing 21 out of 38 complex cases.

While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes