Open-Domain Dialog Evaluation using Follow-Ups Likelihood
This addresses the challenge of reliable automated dialog evaluation for researchers and developers, though it appears incremental as it builds on existing language model techniques.
The paper tackles the problem of automatic evaluation for open-domain dialogs by introducing a method that measures the likelihood of language model-generated follow-ups, achieving the highest correlation with human evaluations among twelve existing methods.
Automatic evaluation of open-domain dialogs remains an unsolved problem. Moreover, existing methods do not correlate strongly with human annotations. This paper presents a new automated evaluation method using follow-ups: we measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g., not really relevant here, what are you trying to say). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.