$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
This addresses the unreliability of neural dialogue models for applications requiring factual accuracy, though it is an incremental improvement over existing evaluation methods.
The paper tackles the problem of factual inconsistency in knowledge-grounded dialogue systems by proposing an automatic evaluation metric called $Q^2$, which uses question generation and answering with natural language inference, and it shows higher correlation with human judgments compared to other metrics.
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.