CL AI LGSep 24, 2025

Feeding Two Birds or Favoring One? Adequacy-Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation

Behzad Shayegh, Jan-Thorsten Peter, David Vilar, Tobias Domhan, Juraj Juraska, Markus Freitag, Lili Mou

arXiv:2509.20287v1h-index: 12Proceedings of the Tenth Conference on Machine Translation

Originality Incremental advance

AI Analysis

This work addresses a bias in machine translation evaluation that affects metric rankings, with incremental improvements in meta-evaluation methodology.

The paper investigates the tradeoff between adequacy and fluency in machine translation evaluation, showing that current metrics and standard meta-evaluation favor adequacy over fluency, and proposes a method to control this bias by synthesizing translation systems.

We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.

View on arXiv PDF

Similar