ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
This provides a large-scale resource for paraphrase generation and semantic knowledge to improve natural language understanding tasks, though it is incremental as it builds on prior work using machine translation for paraphrasing.
The authors tackled the problem of creating high-quality paraphrastic sentence embeddings by introducing ParaNMT-50M, a dataset of over 50 million English-English paraphrase pairs generated via neural machine translation, and demonstrated its utility by training embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition.
We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic knowledge to improve downstream natural language understanding tasks. To show its utility, we use ParaNMT-50M to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.