CLMay 20, 2020

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

arXiv:2005.10070v11019 citations
AI Analysis

This addresses the problem of insufficient training data for supervised models in multi-document summarization, which is incremental as it builds on existing resources.

The authors tackled the lack of large-scale datasets for multi-document summarization by creating a new dataset from the Wikipedia Current Events Portal, which includes many document clusters and large cluster sizes, and they provided empirical results for state-of-the-art techniques.

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes