CLMay 24, 2022

A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation -- through the Lens of Semantic Similarity Rating

arXiv:2205.12176v1628 citationsh-index: 33
Originality Incremental advance
AI Analysis

This work addresses the challenge of meaning-oriented evaluation for NLG metrics, particularly in AMR-to-text tasks, which is important for researchers developing more accurate text generation systems, though it appears incremental as it builds on existing CheckList frameworks.

The paper tackled the problem of evaluating natural language generation (NLG) metrics, which often focus on surface form rather than meaning, by developing a dynamic CheckList organized around meaning-relevant linguistic phenomena, using AMR graphs and human semantic similarity scores, and demonstrated its usefulness with a new metric GraCo that showed potential for future investigation.

Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR. Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning, by developing a dynamic CheckList for NLG metrics that is interpreted by being organized around meaning-relevant linguistic phenomena. Each test instance consists of a pair of sentences with their AMR graphs and a human-produced textual semantic similarity or relatedness score. Our CheckList facilitates comparative evaluation of metrics and reveals strengths and weaknesses of novel and traditional metrics. We demonstrate the usefulness of CheckList by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts. Our analysis suggests that GraCo presents an interesting NLG metric worth future investigation and that meaning-oriented NLG metrics can profit from graph-based metric components using AMR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes