CLAIMay 2, 2023

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

arXiv:2305.01526v178 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This dataset addresses the need for large-scale medical QA resources in Chinese, benefiting medical research, patients, and clinicians, though it is incremental as it builds on existing QA dataset efforts.

The authors introduced Huatuo-26M, a large-scale Chinese medical QA dataset with 26 million QA pairs, and found that existing models perform poorly on it, highlighting its challenge and utility for tasks like zero-shot learning and retrieval-augmented generation.

In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See \url{https://github.com/FreedomIntelligence/Huatuo-26M}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes