One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning
This addresses the need for flexible, multi-lingual evaluation metrics for dialogue systems, though it is incremental as it builds on existing neural network-based approaches.
The paper tackles the problem of automatic evaluation for open-domain dialogue systems across multiple languages by proposing an adversarial multi-task neural metric (ADVMT) with shared feature extraction. Experiments in two languages show it achieves high correlation with human annotation, outperforming monolingual and existing metrics.
Automatic evaluating the performance of Open-domain dialogue system is a challenging problem. Recent work in neural network-based metrics has shown promising opportunities for automatic dialogue evaluation. However, existing methods mainly focus on monolingual evaluation, in which the trained metric is not flexible enough to transfer across different languages. To address this issue, we propose an adversarial multi-task neural metric (ADVMT) for multi-lingual dialogue evaluation, with shared feature extraction across languages. We evaluate the proposed model in two different languages. Experiments show that the adversarial multi-task neural metric achieves a high correlation with human annotation, which yields better performance than monolingual ones and various existing metrics.