CLMar 31, 2021

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

arXiv:2104.00054v2667 citations
AI Analysis

This work addresses the need for more rigorous statistical analysis in summarization evaluation, providing tools to assess metric reliability, though it is incremental in applying existing resampling methods to this domain.

The paper tackled the problem of uncertainty in evaluating summarization metrics by proposing statistical methods to calculate confidence intervals and hypothesis tests for correlation estimates, finding wide confidence intervals indicating high uncertainty and that only QAEval and BERTScore showed statistical improvements over ROUGE in some settings.

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes