AICLLGSep 26, 2019

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

arXiv:1909.12066v21003 citations
Originality Incremental advance
AI Analysis

This work addresses the need for scalable evaluation metrics in conversational AI, though it is incremental with mixed success in improvement applications.

The authors tackled the problem of automated evaluation for conversational dialogue systems by introducing AutoJudge, which uses self-generated dialogues and human ratings to train a model, achieving good correlation with human ratings for evaluation and effective re-ranking of candidate utterances, but failing as a reward for reinforcement learning.

We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes