CLJul 17, 2025

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Richard Sproat, Tianyu Zhao, Llion Jones

arXiv:2507.12724v14.91 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the need for reliable and automated translation evaluation tools for researchers and practitioners in machine translation, though it is incremental as it builds on existing prompting and ranking methods.

The authors tackled the problem of evaluating and ranking machine translations by introducing TransEvalnia, a prompting-based system that uses reasoning to provide fine-grained evaluations and numerical scores, showing it performs as well as or better than the state-of-the-art MT-Ranker on multiple language pairs, with evaluations deemed highly acceptable to human raters and scores correlating well with human assessments.

We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.

View on arXiv PDF

Similar