CLAug 28, 2018

Learning To Split and Rephrase From Wikipedia Edit History

Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, Dipanjan Das

arXiv:1808.09468v132.41120 citations

Originality Incremental advance

AI Analysis

This work addresses the split and rephrase task for natural language processing, providing a significantly larger dataset to improve model performance.

The authors tackled the problem of splitting and rephrasing sentences by extracting a large dataset from Wikipedia edit history, resulting in a model that achieved a 32 BLEU point improvement over prior best results on the WebSplit benchmark.

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.

View on arXiv PDF

Similar