Evaluating the Efficacy of Length-Controllable Machine Translation
This work provides a systematic evaluation framework for a constrained translation task, which is incremental but addresses a specific need in machine translation research.
This paper tackled the problem of evaluating automatic metrics for length-controllable machine translation by conducting a rigorous human evaluation on two translation directions and testing 18 metrics, finding that BLEURT and COMET had the highest correlation with human judgments.
Length-controllable machine translation is a type of constrained translation. It aims to contain the original meaning as much as possible while controlling the length of the translation. We can use automatic summarization or machine translation evaluation metrics for length-controllable machine translation, but this is not necessarily suitable and accurate. This work is the first attempt to evaluate the automatic metrics for length-controllable machine translation tasks systematically. We conduct a rigorous human evaluation on two translation directions and evaluate 18 summarization or translation evaluation metrics. We find that BLEURT and COMET have the highest correlation with human evaluation and are most suitable as evaluation metrics for length-controllable machine translation.