CLJul 30, 2019

Abstractive Document Summarization without Parallel Data

Nikola I. Nikolov, Richard H. R. Hahnloser

arXiv:1907.12951v230.11001 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of data scarcity in summarization for NLP applications, offering a practical solution for domains with limited parallel data, though it is incremental as it builds on existing unsupervised and synthetic training methods.

The paper tackles the problem of abstractive document summarization when parallel article-summary pairs are scarce by developing a system that uses only example summaries and non-matching articles, showing promising performance on benchmarks like CNN/DailyMail and a novel press release generation task without relying on paired data.

Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines, as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs.

View on arXiv PDF Code

Similar