CLLGMay 1, 2020

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

arXiv:2005.00456v11053 citations
Originality Highly original
AI Analysis

This addresses the lack of effective automatic evaluation metrics for open-domain dialog research, which has been a bottleneck in the field.

The paper tackles the problem of evaluating dialog generation by proposing USR, an unsupervised and reference-free metric that strongly correlates with human judgment, achieving turn-level correlations of 0.42-0.48 and system-level correlations of 1.0 on datasets like Topical-Chat and PersonaChat.

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes