CL IR LGOct 18, 2018

WikiHow: A Large Scale Text Summarization Dataset

arXiv:1810.09305v115.8342 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a new dataset for text summarization researchers, addressing the limitation of existing datasets that are mostly news-focused, but it is incremental as it focuses on data creation rather than method innovation.

The authors tackled the lack of large-scale, high-quality datasets for text summarization by introducing WikiHow, a dataset of over 230,000 article-summary pairs from an online knowledge base, which spans diverse topics and writing styles. They evaluated existing methods on this dataset to establish baselines and highlight its challenges.

Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.

View on arXiv PDF Code

Similar