CLJun 5, 2021

Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation

arXiv:2106.02833v1714 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust automated evaluation for open-domain dialog systems, which is incremental by building on prior methods to reduce reliance on costly human annotations.

The paper tackled the problem of expensive and unscalable human-written reference collection for automated evaluation of open-domain dialog by proposing a novel technique to automatically expand a single human-generated reference into multiple plausible candidates using commonsense knowledge bases and dialog corpus retrieval. The result showed large improvements in correlations of automated metrics with human ratings on the DailyDialog dataset.

Multiple different responses are often plausible for a given open domain dialog context. Prior work has shown the importance of having multiple valid reference responses for meaningful and robust automated evaluations. In such cases, common practice has been to collect more human written references. However, such collection can be expensive, time consuming, and not easily scalable. Instead, we propose a novel technique for automatically expanding a human generated reference to a set of candidate references. We fetch plausible references from knowledge sources, and adapt them so that they are more fluent in context of the dialog instance in question. More specifically, we use (1) a commonsense knowledge base to elicit a large number of plausible reactions given the dialog history (2) relevant instances retrieved from dialog corpus, using similar past as well as future contexts. We demonstrate that our automatically expanded reference sets lead to large improvements in correlations of automated metrics with human ratings of system outputs for DailyDialog dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes