CLJun 11, 2020

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

arXiv:2006.06264v21052 citations
AI Analysis

This work addresses the reliability of evaluation protocols for machine translation metrics, which is crucial for researchers and practitioners in NLP, though it is incremental in refining existing methods.

The paper tackles the problem of evaluating automatic machine translation metrics by showing that current methods are highly sensitive to translation outliers, leading to false confidence, and develops a method for thresholding performance improvements to quantify errors in system ranking.

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes