CLSep 23, 2025

Evaluating Language Translation Models by Playing Telephone

arXiv:2509.19611v11 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This addresses the bottleneck in evaluating translation models for researchers and developers, though it is incremental as it builds on existing evaluation systems.

The paper tackles the problem of evaluating machine translation quality by proposing an unsupervised method to generate training data through repeated translation rounds, resulting in improved performance over xCOMET on scoring translation quality and selecting closer translations.

Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models--which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes