CLJul 21, 2017

Why We Need New Evaluation Metrics for NLG

arXiv:1707.06875v11279 citations
Originality Synthesis-oriented
AI Analysis

This work highlights a critical problem for NLG researchers and practitioners by showing the limitations of existing evaluation metrics, which is incremental as it builds on prior critiques.

The paper demonstrates that current automatic metrics for natural language generation (NLG) only weakly reflect human judgments and are data- and system-specific, but they can still be reliable at the system level for identifying poor performance.

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes