Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge
This work addresses the problem of assessing LLM applicability in specialized domains like TCM for researchers and practitioners, though it is incremental as it applies existing methods to new data.
The paper tackled the lack of research on Large Language Models (LLMs) in Traditional Chinese Medicine (TCM) by creating a TCM-QA dataset and evaluating ChatGPT, finding it performed best on true or false questions with a precision of 0.688 and worst on multiple-choice questions at 0.241, with Chinese prompts outperforming English ones.
No previous work has studied the performance of Large Language Models (LLMs) in the context of Traditional Chinese Medicine (TCM), an essential and distinct branch of medical knowledge with a rich history. To bridge this gap, we present a TCM question dataset named TCM-QA, which comprises three question types: single choice, multiple choice, and true or false, to examine the LLM's capacity for knowledge recall and comprehensive reasoning within the TCM domain. In our study, we evaluate two settings of the LLM, zero-shot and few-shot settings, while concurrently discussing the differences between English and Chinese prompts. Our results indicate that ChatGPT performs best in true or false questions, achieving the highest precision of 0.688 while scoring the lowest precision is 0.241 in multiple-choice questions. Furthermore, we observed that Chinese prompts outperformed English prompts in our evaluations. Additionally, we assess the quality of explanations generated by ChatGPT and their potential contribution to TCM knowledge comprehension. This paper offers valuable insights into the applicability of LLMs in specialized domains and paves the way for future research in leveraging these powerful models to advance TCM.