CLMar 15, 2018

RankME: Reliable Human Ratings for Natural Language Generation

arXiv:1803.05928v11122 citations
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable human evaluations for NLG systems, which is crucial for researchers and practitioners in natural language processing, though it is an incremental improvement in experimental design.

The paper tackled the problem of inconsistent human ratings in natural language generation (NLG) evaluation by introducing RankME, a rank-based magnitude estimation method that combines continuous scales and relative assessments, resulting in significantly improved reliability and consistency of ratings compared to traditional methods.

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes