CLMay 13, 2022

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

MicrosoftSalesforce
arXiv:2205.06871v2292 citationsh-index: 38
Originality Incremental advance
AI Analysis

This provides a low-cost, reusable evaluation method for NLG researchers and practitioners, though it is incremental as it builds on existing human annotations rather than introducing a fundamentally new paradigm.

The paper tackles the challenge of costly and non-reusable human evaluation in natural language generation (NLG) by proposing Near-Negative Distinction (NND), an automatic evaluation method that repurposes human annotations into tests where models must distinguish high-quality outputs from near-negatives with errors, achieving higher correlation with human judgments than standard metrics in experiments on three NLG tasks.

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes