CL AISep 21, 2023

MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models

Yidong Liu, FuKai Shang, Fang Wang, Rui Xu, Jun Wang, Wei Li, Yao Li, Conghui He

arXiv:2309.13079v21.32 citationsh-index: 60

Originality Synthesis-oriented

AI Analysis

This provides a specialized dataset for Chinese vertical domains, aiding research and applications, but it is incremental as it builds on existing pre-training methods.

The paper tackles the need for high-quality domain-specific outputs by introducing the MiChao-HuaFen 1.0 pre-trained corpus dataset, tailored for news and governmental sectors, sourced from 2022 internet data with cleansing and updates.

With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have demonstrated exceptional capabilities across various domains. Nevertheless, there remains a demand for high-quality, domain-specific outputs in areas like healthcare, law, and finance. This paper first evaluates the existing large models for specialized domains and discusses their limitations. To cater to the specific needs of certain domains, we introduce the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and governmental sectors. The dataset, sourced from publicly available internet data from 2022, underwent multiple rounds of cleansing and processing to ensure high quality and reliable origins, with provisions for consistent and stable updates. This dataset not only supports the pre-training of large models for Chinese vertical domains but also aids in propelling deep learning research and applications in related fields.

View on arXiv PDF

Similar