CLAILGMay 21, 2020

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

arXiv:2005.10716v21020 citations
AI Analysis

This addresses the challenge of reliable automatic evaluation for open-domain dialog systems, particularly for social conversational systems like Amazon Alexa Prize chatbots, but it is incremental as it builds on existing user rating methods.

The paper tackles the problem of bias and variance in self-reported user ratings for dialog system evaluation by formulating it as a comparison task and proposing CMADE, an automatic model that cleans these ratings, achieving 89.2% accuracy.

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes