CLAILGNEMar 25, 2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

arXiv:1603.08023v21388 citations
AI Analysis

This work addresses the problem of unreliable automatic evaluation for dialogue systems, which is crucial for researchers and developers, and is incremental in highlighting specific weaknesses in existing metrics.

The study examined unsupervised evaluation metrics for dialogue response generation, finding that metrics borrowed from machine translation correlate poorly with human judgments, with weak correlation in Twitter data and none in Ubuntu data.

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes