CLFeb 18, 2025

Baichuan-M1: Pushing the Medical Capability of Large Language Models

arXiv:2502.12671v238 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses the need for efficient and practical medical LLMs, though it is incremental as it builds on existing LLM paradigms for a specific domain.

The authors tackled the scarcity of domain-specific large language models in medicine by introducing Baichuan-M1, a series trained from scratch on 20 trillion tokens, which excels in medical applications while maintaining strong general capabilities.

The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes