CLAug 10, 2020

A Large-Scale Chinese Short-Text Conversation Dataset

arXiv:2008.03946v2146 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for researchers in short-text conversation modeling, though it is incremental as it focuses on data curation rather than novel methods.

The authors tackled the lack of large-scale high-quality dialogue corpora for training neural dialogue generation models by presenting LCCC, a cleaned Chinese conversation dataset with 6.8 million and 12.0 million dialogues, which includes pre-training models to facilitate research.

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes