CLIRLGOct 14, 2020

Re-evaluating Evaluation in Text Summarization

arXiv:2010.07100v11040 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a critical issue for researchers and practitioners in NLP by highlighting the unreliability of standard metrics, though it is incremental as it re-evaluates existing methods rather than proposing new ones.

The paper tackles the problem of outdated automatic evaluation metrics in text summarization, finding that conclusions about metrics like ROUGE do not hold on modern datasets and systems.

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes