CLMar 11, 2021

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

arXiv:2103.06410v2753 citations
Originality Synthesis-oriented
AI Analysis

This dataset provides a valuable resource for researchers in natural language processing working on dialogue summarization, though it is incremental as it builds on existing data collection methods.

The authors introduced MediaSum, a large-scale dataset of 463.6K media interview transcripts with abstractive summaries, collected from NPR and CNN, to address the lack of resources for dialogue summarization. They demonstrated its utility by showing it can improve model performance on other tasks through transfer learning.

MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes