CLSep 16, 2019

Communication-based Evaluation for Natural Language Generation

arXiv:1909.07290v2999 citations
AI Analysis

This addresses the problem of evaluating NLG systems more effectively for researchers and practitioners, though it is incremental as it builds on existing pragmatic models.

The paper tackles the misalignment of n-gram overlap measures like BLEU and ROUGE with true goals in natural language generation by proposing communication-based evaluations using the Rational Speech Acts model, showing that this method better aligns with pre-defined quality categories on a color reference dataset.

Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a reader/listener, we can directly evaluate its effectiveness at this task using the Rational Speech Acts model of pragmatic language use. We illustrate with a color reference dataset that contains descriptions in pre-defined quality categories, showing that our method better aligns with these quality categories than do any of the prominent n-gram overlap methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes