CLApr 14, 2020

A Human Evaluation of AMR-to-English Generation Systems

arXiv:2004.06814v2996 citations
AI Analysis

This work addresses the evaluation gap for natural language generation systems, offering insights for researchers and practitioners in computational linguistics, though it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating AMR-to-English generation systems by conducting a human evaluation to assess fluency, adequacy, and error types, finding that while automated metrics like BLEU generally rank systems correctly, human judgments provide more nuanced comparisons.

Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes