Improving Large-scale Paraphrase Acquisition and Generation
This work addresses data quality problems for researchers in natural language processing, particularly in paraphrase tasks, but it is incremental as it builds on existing datasets and methods.
The paper tackled quality issues in Twitter-based paraphrase datasets by introducing a new corpus with separate definitions for identification and generation, achieving state-of-the-art performance of 84.2 F1 for paraphrase identification and generating more diverse and high-quality paraphrases compared to other datasets.
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.