Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level
This work tackles the challenge of evaluating machine translation at the paragraph level, but it is incremental as it finds no significant advantage over existing sentence-level methods.
The study addressed the effectiveness of automatic evaluation metrics for paragraph-level machine translation by creating paragraph-level datasets from existing sentence-level data and benchmarking existing metrics. The results showed that sentence-level metrics performed equally well as paragraph-level metrics, suggesting limitations in the datasets or evaluation task.
As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.