CLOct 6, 2020

GRUEN for Evaluating Linguistic Quality of Generated Text

arXiv:2010.02498v11000 citations
AI Analysis

This addresses the need for automated evaluation of linguistic quality in text generation, though it is incremental as it builds on existing BERT-based methods.

The authors tackled the problem of evaluating the linguistic quality of generated text, which existing metrics had ignored, by proposing GRUEN, a reference-less metric that correlates highly with human judgments across seven datasets and four tasks.

Automatic evaluation metrics are indispensable for evaluating generated text. To date, these metrics have focused almost exclusively on the content selection aspect of the system output, ignoring the linguistic quality aspect altogether. We bridge this gap by proposing GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text. GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output. Unlike most existing evaluation metrics which require human references as an input, GRUEN is reference-less and requires only the system output. Besides, it has the advantage of being unsupervised, deterministic, and adaptable to various tasks. Experiments on seven datasets over four language generation tasks show that the proposed metric correlates highly with human judgments.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes