CL AIMay 2, 2023

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, Benyou Wang

arXiv:2305.01526v114.078 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This dataset addresses the need for large-scale medical QA resources in Chinese, benefiting medical research, patients, and clinicians, though it is incremental as it builds on existing QA dataset efforts.

The authors introduced Huatuo-26M, a large-scale Chinese medical QA dataset with 26 million QA pairs, and found that existing models perform poorly on it, highlighting its challenge and utility for tasks like zero-shot learning and retrieval-augmented generation.

In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See \url{https://github.com/FreedomIntelligence/Huatuo-26M}.

View on arXiv PDF Code

Similar