CLSep 13, 2022

Rethink about the Word-level Quality Estimation for Machine Translation from Human Judgement

Zhen Yang, Fandong Meng, Yuanmeng Yan, Jie Zhou

Tsinghua

arXiv:2209.05695v10.63 citationsh-index: 49Has Code

Originality Incremental advance

AI Analysis

This addresses the limitation of automated quality estimation metrics for machine translation, which often conflict with human judgment, by providing a more reliable dataset and methods for researchers and practitioners.

The authors tackled the problem of word-level quality estimation for machine translation by creating a new dataset based on human judgment instead of post-editing effort, and they proposed self-supervised pre-training strategies to align existing data with this benchmark. The results showed their dataset is more consistent with human judgment and the strategies improved performance on WMT En-De and En-Zh corpora.

Word-level Quality Estimation (QE) of Machine Translation (MT) aims to find out potential translation errors in the translated sentence without reference. Typically, conventional works on word-level QE are designed to predict the translation quality in terms of the post-editing effort, where the word labels ("OK" and "BAD") are automatically generated by comparing words between MT sentences and the post-edited sentences through a Translation Error Rate (TER) toolkit. While the post-editing effort can be used to measure the translation quality to some extent, we find it usually conflicts with the human judgement on whether the word is well or poorly translated. To overcome the limitation, we first create a golden benchmark dataset, namely \emph{HJQE} (Human Judgement on Quality Estimation), where the expert translators directly annotate the poorly translated words on their judgements. Additionally, to further make use of the parallel corpus, we propose the self-supervised pre-training with two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to \emph{HJQE}. We conduct substantial experiments based on the publicly available WMT En-De and En-Zh corpora. The results not only show our proposed dataset is more consistent with human judgment but also confirm the effectiveness of the proposed tag correcting strategies.\footnote{The data can be found at \url{https://github.com/ZhenYangIACAS/HJQE}.}

View on arXiv PDF Code

Similar