CLAIHCJun 23, 2020

Unsupervised Evaluation of Interactive Dialog with DialoGPT

arXiv:2006.12719v11034 citations
Originality Incremental advance
AI Analysis

This provides a more interpretable automatic evaluation method for dialog research, addressing a known bottleneck in the field.

The paper tackled the problem of evaluating open-domain dialog systems by introducing FED, an unsupervised metric using DialoGPT to measure fine-grained dialog qualities without ground-truth responses or training data, achieving moderate to strong correlation with human judgment.

It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes