CLJun 23, 2015

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, Bill Dolan

arXiv:1506.06863v224.7159 citations

Originality Incremental advance

AI Analysis

This provides a more accurate evaluation metric for natural language generation tasks with diverse targets, such as conversational AI, though it is incremental as it builds on existing BLEU methods.

The paper tackled the problem of evaluating generated text in tasks with diverse outputs by introducing deltaBLEU, a discriminative metric that weights multi-reference BLEU based on human-rated quality scores, and found it correlates reasonably with human judgments and outperforms other BLEU variants in conversational response generation.

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.

View on arXiv PDF

Similar