Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019
This work critically evaluates benchmark claims in machine translation, highlighting methodological flaws for researchers and practitioners.
The paper reassesses claims of human parity and super-human performance in machine translation from WMT 2019, identifying issues in human evaluation and conducting a modified evaluation that refutes most claims, except for human parity in English-to-German.
We reassess the claims of human parity and super-human performance made at the news shared task of WMT 2019 for three translation directions: English-to-German, English-to-Russian and German-to-English. First we identify three potential issues in the human evaluation of that shared task: (i) the limited amount of intersentential context available, (ii) the limited translation proficiency of the evaluators and (iii) the use of a reference translation. We then conduct a modified evaluation taking these issues into account. Our results indicate that all the claims of human parity and super-human performance made at WMT 2019 should be refuted, except the claim of human parity for English-to-German. Based on our findings, we put forward a set of recommendations and open questions for future assessments of human parity in machine translation.