CLApr 4, 2020

Pre-training for Abstractive Document Summarization by Reinstating Source Text

Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, Ming Zhou

arXiv:2004.01853v431.21007 citationsh-index: 82

Originality Incremental advance

AI Analysis

This work addresses the data efficiency problem for researchers and practitioners in natural language processing by providing a more resource-effective pre-training method for abstractive summarization, though it is incremental as it builds on existing pre-training paradigms.

The paper tackles the challenge of training large sequence-to-sequence models for abstractive document summarization with limited supervised data by introducing three pre-training objectives that reinstate original documents from artificially constructed inputs, achieving comparable results to models pre-trained on over 160GB of data using only 19GB of text.

Abstractive document summarization is usually modeled as a sequence-to-sequence (Seq2Seq) learning problem. Unfortunately, training large Seq2Seq based summarization models on limited supervised summarization data is challenging. This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text. The main idea is that, given an input text artificially constructed from a document, a model is pre-trained to reinstate the original document. These objectives include sentence reordering, next sentence generation, and masked document generation, which have close relations with the abstractive document summarization task. Experiments on two benchmark summarization datasets (i.e., CNN/DailyMail and New York Times) show that all three objectives can improve performance upon baselines. Compared to models pre-trained on large-scale data (more than 160GB), our method, with only 19GB text for pre-training, achieves comparable results, which demonstrates its effectiveness.

View on arXiv PDF

Similar