CLSep 23, 2019

Towards Best Experiment Design for Evaluating Dialogue System Output

arXiv:1909.10122v11003 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of inconsistent human evaluations in dialogue systems, offering incremental improvements for researchers and practitioners in natural language processing.

The study investigated how different experiment designs affect the consistency of human ratings for dialogue system outputs, finding that continuous scales yield more consistent ratings than Likert scales or ranking-based methods, with factors like task completion time and rater inexperience positively impacting agreement.

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes