CLAIFeb 17, 2022

On the Evaluation Metrics for Paraphrase Generation

arXiv:2202.08479v2296 citations
AI Analysis

This addresses the challenge of accurate evaluation in natural language processing for researchers and practitioners, though it is incremental as it builds on existing metric frameworks.

The paper tackles the problem of evaluating paraphrase generation by analyzing automatic metrics, finding that reference-free metrics outperform reference-based ones and most metrics poorly align with human judgments, and proposes ParaScore, which significantly outperforms existing metrics.

In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Experimental results demonstrate that ParaScore significantly outperforms existing metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes