CLLGJun 21, 2021

How well do you know your summarization datasets?

arXiv:2106.11388v1715 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable evaluation in summarization for researchers and practitioners by highlighting dataset limitations, though it is incremental as it analyzes existing datasets without proposing new methods.

The study manually analyzed 600 samples from three popular summarization datasets to assess data quality and complexity, finding that dataset characteristics affect model performance and metric reliability, with key insights including distinct distributions and issues like low scores for faithful summaries due to poor reference diversity.

State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyze 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise types (missing facts, entities) and degrees of summarization difficulty (extractive, abstractive). We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics, and report our key insights: (1) Datasets have distinct data quality and complexity distributions, which can be traced back to their collection process. (2) The performance of models and reliability of metrics is dependent on sample complexity. (3) Faithful summaries often receive low scores because of the poor diversity of references. We release the code, annotated data and model outputs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes