CLApr 29, 2020

Evaluating Dialogue Generation Systems via Response Selection

arXiv:2004.14302v11001 citations
AI Analysis

This work addresses the evaluation challenge for dialogue generation systems, but it is incremental as it builds on existing response selection approaches.

The paper tackles the problem of poor correlation between automatic metrics and human evaluation in open-domain dialogue generation by proposing a method to construct response selection test sets with carefully filtered false candidates, resulting in stronger correlation with human evaluation compared to metrics like BLEU.

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose the method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes