CL AI LGOct 20, 2025

Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

Jiacheng Xie, Shuai Zeng, Yang Yu, Xiaoting Tang, Guanghui An, Dong Xu

arXiv:2510.17402v14.91 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of aligning LLMs with expert-level reasoning in traditional medical domains for developing trustworthy TCM AI systems, though it is incremental as it builds on existing reinforcement learning methods applied to a specific domain.

The study tackled the challenge of applying large language models (LLMs) to Traditional Chinese Medicine (TCM) by introducing Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), which demonstrated superior performance across multiple reasoning metrics compared to state-of-the-art general-purpose and domain-specific models.

Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.

View on arXiv PDF

Similar