CLAug 1, 2022

Multi-Document Summarization with Centroid-Based Pretraining

Ratish Puduppully, Parag Jain, Nancy F. Chen, Mark Steedman

arXiv:2208.01006v221.5226 citationsh-index: 61Has Code

Originality Highly original

AI Analysis

This work addresses the challenge of pretraining for MDS without requiring labeled summaries, which could benefit researchers and practitioners in natural language processing by enabling more efficient model development.

The paper tackles the problem of pretraining for Multi-Document Summarization (MDS) by introducing a novel objective that uses ROUGE-based centroids as proxy summaries, eliminating the need for human-written summaries. The model Centrum achieves results that are better or comparable to state-of-the-art models in zero-shot, few-shot, and fully supervised experiments on multiple datasets.

In Multi-Document Summarization (MDS), the input can be modeled as a set of documents, and the output is its summary. In this paper, we focus on pretraining objectives for MDS. Specifically, we introduce a novel pretraining objective, which involves selecting the ROUGE-based centroid of each document cluster as a proxy for its summary. Our objective thus does not require human written summaries and can be utilized for pretraining on a dataset consisting solely of document sets. Through zero-shot, few-shot, and fully supervised experiments on multiple MDS datasets, we show that our model Centrum is better or comparable to a state-of-the-art model. We make the pretrained and fine-tuned models freely available to the research community https://github.com/ratishsp/centrum.

View on arXiv PDF Code

Similar