CLOct 5, 2021

Investigating the Impact of Pre-trained Language Models on Dialog Evaluation

Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Thomas Friedrichs, Haizhou Li

arXiv:2110.01895v21.26 citations

Originality Synthesis-oriented

AI Analysis

This study provides a comprehensive assessment for researchers in dialog systems, though it is incremental as it analyzes existing models rather than proposing new ones.

The paper investigated how different pre-trained language models affect the performance of automatic dialog evaluation metrics, finding that model choice significantly impacts results across three benchmarks and eight models.

Recently, there is a surge of interest in applying pre-trained language models (Pr-LM) in automatic open-domain dialog evaluation. Pr-LMs offer a promising direction for addressing the multi-domain evaluation challenge. Yet, the impact of different Pr-LMs on the performance of automatic metrics is not well-understood. This paper examines 8 different Pr-LMs and studies their impact on three typical automatic dialog evaluation metrics across three different dialog evaluation benchmarks. Specifically, we analyze how the choice of Pr-LMs affects the performance of automatic metrics. Extensive correlation analyses on each of the metrics are performed to assess the effects of different Pr-LMs along various axes, including pre-training objectives, dialog evaluation criteria, model size, and cross-dataset robustness. This study serves as the first comprehensive assessment of the effects of different Pr-LMs on automatic dialog evaluation.

View on arXiv PDF

Similar