RankME: Reliable Human Ratings for Natural Language Generation
This addresses the issue of unreliable human evaluations for NLG systems, which is crucial for researchers and practitioners in natural language processing, though it is an incremental improvement in experimental design.
The paper tackled the problem of inconsistent human ratings in natural language generation (NLG) evaluation by introducing RankME, a rank-based magnitude estimation method that combines continuous scales and relative assessments, resulting in significantly improved reliability and consistency of ratings compared to traditional methods.
Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.