CLLGMay 1, 2020

Learning an Unreferenced Metric for Online Dialogue Evaluation

arXiv:2005.00583v11026 citations
AI Analysis

This addresses the challenge of scalable and generalizable dialogue evaluation for AI systems, though it is incremental as it builds on existing metric approaches.

The paper tackled the problem of automatically evaluating dialogue quality without needing human-generated reference responses, proposing an unreferenced metric using pre-trained language models and temporal transitions, which achieved higher correlation with human annotations in online settings.

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during inference, making it infeasible for online evaluation. Here, we propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes