CLDec 19, 2022

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, William Yang Wang

CMU

arXiv:2212.09305v221.7227 citationsh-index: 63Has Code

Originality Incremental advance

AI Analysis

This addresses the need for scalable and language-agnostic evaluation metrics in text generation, offering a practical solution for researchers and practitioners, though it is incremental as it builds on prior self-supervised methods.

The paper tackles the problem of training a general metric for evaluating text generation quality without human annotations by proposing SESCORE2, a self-supervised approach that synthesizes realistic mistakes from corpus data, resulting in outperforming unsupervised and even some supervised metrics across multiple tasks and languages with a Kendall improvement of 0.078.

Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.

View on arXiv PDF Code

Similar