CLOct 25, 2022

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Chen Zhang, Luis Fernando D'Haro, Qiquan Zhang, Thomas Friedrichs, Haizhou Li

arXiv:2210.13832v224.2294 citationsh-index: 25Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for more comprehensive dialogue evaluation metrics for researchers and developers in conversational AI, though it is incremental as it builds on existing model-based metrics.

The paper tackles the problem of evaluating open-domain dialogues by proposing a multi-dimensional, dialogue-level metric that assesses multiple quality dimensions, achieving around 16% relative improvement over existing state-of-the-art metrics on benchmarks.

Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment. However, they either perform turn-level evaluation or look at a single dialogue quality dimension. One would expect a good evaluation metric to assess multiple quality dimensions at the dialogue level. To this end, we are motivated to propose a multi-dimensional dialogue-level metric, which consists of three sub-metrics with each targeting a specific dimension. The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions. Moreover, we explore two approaches to combine the sub-metrics: metric ensemble and multitask learning. Both approaches yield a holistic metric that significantly outperforms individual sub-metrics. Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average across three high-quality dialogue-level evaluation benchmarks.

View on arXiv PDF Code

Similar