CLJun 23, 2015

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

arXiv:1506.06863v2159 citations
Originality Incremental advance
AI Analysis

This provides a more accurate evaluation metric for natural language generation tasks with diverse targets, such as conversational AI, though it is incremental as it builds on existing BLEU methods.

The paper tackled the problem of evaluating generated text in tasks with diverse outputs by introducing deltaBLEU, a discriminative metric that weights multi-reference BLEU based on human-rated quality scores, and found it correlates reasonably with human judgments and outperforms other BLEU variants in conversational response generation.

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes