CL AI LGMay 21, 2020

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

arXiv:2005.10716v231.41020 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of reliable automatic evaluation for open-domain dialog systems, particularly for social conversational systems like Amazon Alexa Prize chatbots, but it is incremental as it builds on existing user rating methods.

The paper tackles the problem of bias and variance in self-reported user ratings for dialog system evaluation by formulating it as a comparison task and proposing CMADE, an automatic model that cleans these ratings, achieving 89.2% accuracy.

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.

View on arXiv PDF Code

Similar