CLApr 29, 2020

Evaluating Dialogue Generation Systems via Response Selection

Shiki Sato, Reina Akama, Hiroki Ouchi, Jun Suzuki, Kentaro Inui

arXiv:2004.14302v131.11001 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the evaluation challenge for dialogue generation systems, but it is incremental as it builds on existing response selection approaches.

The paper tackles the problem of poor correlation between automatic metrics and human evaluation in open-domain dialogue generation by proposing a method to construct response selection test sets with carefully filtered false candidates, resulting in stronger correlation with human evaluation compared to metrics like BLEU.

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose the method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

View on arXiv PDF Code

Similar