Consistent Human Evaluation of Machine Translation across Language Pairs
This addresses the problem of unreliable quality assessment for machine translation systems, which is crucial for researchers and developers, though it is incremental as it builds on existing evaluation methods.
The paper tackled the challenge of inconsistent human evaluation of machine translation across language pairs by proposing a new metric focused on semantic equivalence and a cross-lingual calibration method, demonstrating effectiveness in large-scale studies across up to 14 language pairs with improved consistency.
Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more consistent assessment. We demonstrate the effectiveness of these novel contributions in large scale evaluation studies across up to 14 language pairs, with translation both into and out of English.