Non-Autoregressive Neural Machine Translation: A Call for Clarity
This work addresses evaluation inconsistencies for researchers in machine translation, though it is incremental as it focuses on benchmarking rather than introducing new methods.
The authors tackled the problem of inconsistent evaluation in non-autoregressive neural machine translation, which often leads to inferior translation quality compared to autoregressive models. They provided standardized BLEU, chrF++, and TER scores on four translation tasks, revealing deviations of up to 1.7 BLEU points due to tokenization inconsistencies.
Non-autoregressive approaches aim to improve the inference speed of translation models by only requiring a single forward pass to generate the output sequence instead of iteratively producing each predicted token. Consequently, their translation quality still tends to be inferior to their autoregressive counterparts due to several issues involving output token interdependence. In this work, we take a step back and revisit several techniques that have been proposed for improving non-autoregressive translation models and compare their combined translation quality and speed implications under third-party testing environments. We provide novel insights for establishing strong baselines using length prediction or CTC-based architecture variants and contribute standardized BLEU, chrF++, and TER scores using sacreBLEU on four translation tasks, which crucially have been missing as inconsistencies in the use of tokenized BLEU lead to deviations of up to 1.7 BLEU points. Our open-sourced code is integrated into fairseq for reproducibility.