IR AI CLMar 5, 2018

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

arXiv:1803.01937v1199 citations

Originality Incremental advance

AI Analysis

This addresses the problem of inaccurate and limited automatic evaluation for summarization tasks, which is crucial for researchers and developers in natural language processing, though it appears incremental as an update to an existing standard.

The paper tackles the limitations of ROUGE measures in evaluating summarization tasks by introducing ROUGE 2.0, which includes updated measures like ROUGE-N+Synonyms and ROUGE-Topic to better capture synonymous concepts and topic coverage, resulting in improved evaluation metrics.

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While ROUGE has been shown to be effective in capturing n-gram overlap between system and human composed summaries, there are several limitations with the existing ROUGE measures in terms of capturing synonymous concepts and coverage of topics. Thus, often times ROUGE scores do not reflect the true quality of summaries and prevents multi-faceted evaluation of summaries (i.e. by topics, by overall content coverage and etc). In this paper, we introduce ROUGE 2.0, which has several updated measures of ROUGE: ROUGE-N+Synonyms, ROUGE-Topic, ROUGE-Topic+Synonyms, ROUGE-TopicUniq and ROUGE-TopicUniq+Synonyms; all of which are improvements over the core ROUGE measures.

View on arXiv PDF

Similar