CL IR LGOct 14, 2020

Re-evaluating Evaluation in Text Summarization

Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig

arXiv:2010.07100v132.11040 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses a critical issue for researchers and practitioners in NLP by highlighting the unreliability of standard metrics, though it is incremental as it re-evaluates existing methods rather than proposing new ones.

The paper tackles the problem of outdated automatic evaluation metrics in text summarization, finding that conclusions about metrics like ROUGE do not hold on modern datasets and systems.

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

View on arXiv PDF Code

Similar