CL AIMar 11, 2025

Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins

arXiv:2503.08327v210.96 citationsh-index: 18TACL

Originality Incremental advance

AI Analysis

This addresses a critical issue for machine translation practitioners by mitigating overoptimism in evaluation, though it is incremental as it builds on existing metric analysis.

The paper tackles the problem of metric interference (MINT) in machine translation, where using the same or related metrics for tuning and evaluation leads to overoptimistic performance estimates that lose correlation with human judgments. It proposes MINTADJUST, which on the WMT24 test set ranks translations and systems more accurately than state-of-the-art metrics across most language pairs, outperforming AUTORANK.

As automatic metrics become increasingly stronger and widely adopted, the risk of unintentionally "gaming the metric" during model development rises. This issue is caused by metric interference (MINT), i.e., the use of the same or related metrics for both model tuning and evaluation. MINT can misguide practitioners into being overoptimistic about the performance of their systems: as system outputs become a function of the interfering metric, their estimated quality loses correlation with human judgments. In this work, we analyze two common cases of MINT in machine translation-related tasks: filtering of training data, and decoding with quality signals. Importantly, we find that MINT strongly distorts instance-level metric scores, even when metrics are not directly optimized for-questioning the common strategy of leveraging a different, yet related metric for evaluation that is not used for tuning. To address this problem, we propose MINTADJUST, a method for more reliable evaluation under MINT. On the WMT24 MT shared task test set, MINTADJUST ranks translations and systems more accurately than state-of-the-art metrics across a majority of language pairs, especially for high-quality systems. Furthermore, MINTADJUST outperforms AUTORANK, the ensembling method used by the organizers.

View on arXiv PDF

Similar