CLMar 10, 2025

TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine

arXiv:2503.07041v111 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating AI models in culturally grounded medical domains like TCM for researchers and practitioners, though it is incremental as it applies existing evaluation methods to a new domain.

The authors tackled the underexplored evaluation of large language models (LLMs) in traditional Chinese medicine (TCM) by introducing TCM-3CEval, a triaxial benchmark assessing models across core knowledge, classical text understanding, and clinical decision-making, revealing performance gaps in specialized subdomains like Meridian & Acupoint theory and showing that models with Chinese linguistic and cultural priors perform better.

Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes