CLApr 24, 2018

Data-driven Summarization of Scientific Articles

Nikola I. Nikolov, Michael Pfeiffer, Richard H. R. Hahnloser

arXiv:1804.08875v13.844 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work provides a new domain-specific benchmark for researchers in natural language processing, though it is incremental as it adapts existing methods to a new data source.

The authors tackled the problem of multi-sentence text summarization by proposing scientific articles as a new benchmark, generating two novel datasets and testing existing neural approaches, with results showing suitability for data-driven summarization and serving as benchmarks for scaling to long sequences.

Data-driven approaches to sequence-to-sequence modelling have been successfully applied to short text summarization of news articles. Such models are typically trained on input-summary pairs consisting of only a single or a few sentences, partially due to limited availability of multi-sentence training data. Here, we propose to use scientific articles as a new milestone for text summarization: large-scale training data come almost for free with two types of high-quality summaries at different levels - the title and the abstract. We generate two novel multi-sentence summarization datasets from scientific articles and test the suitability of a wide range of existing extractive and abstractive neural network-based summarization approaches. Our analysis demonstrates that scientific papers are suitable for data-driven text summarization. Our results could serve as valuable benchmarks for scaling sequence-to-sequence models to very long sequences.

View on arXiv PDF Code

Similar