CLJun 10, 2023

Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC

arXiv:2306.06345v31 citationsh-index: 52
Originality Incremental advance
AI Analysis

This work addresses the trade-off between speed and quality in machine translation for applications requiring fast inference, though it is incremental as it builds on existing non-autoregressive methods.

The paper tackles the problem of low translation quality in non-autoregressive translation models by introducing techniques like pretrained language model fine-tuning, embedding distillation, and upsampling, achieving a BLEU score of 39.59 on IWSLT'14 DE→EN and a 16.35x speed improvement over autoregressive models.

Non-autoregressive approaches aim to improve the inference speed of translation models, particularly those that generate output in a one-pass forward manner. However, these approaches often suffer from a significant drop in translation quality compared to autoregressive models. This paper introduces a series of innovative techniques to enhance the translation quality of Non-Autoregressive Translation (NAT) models while maintaining a substantial acceleration in inference speed. We propose fine-tuning Pretrained Multilingual Language Models (PMLMs) with the CTC loss to train NAT models effectively. Furthermore, we adopt the MASK insertion scheme for up-sampling instead of token duplication, and we present an embedding distillation method to further enhance performance. In our experiments, our model outperforms the baseline autoregressive model (Transformer \textit{base}) on multiple datasets, including WMT'14 DE$\leftrightarrow$EN, WMT'16 RO$\leftrightarrow$EN, and IWSLT'14 DE$\leftrightarrow$EN. Notably, our model achieves better performance than the baseline autoregressive model on the IWSLT'14 En$\leftrightarrow$De and WMT'16 En$\leftrightarrow$Ro datasets, even without using distillation data during training. It is worth highlighting that on the IWSLT'14 DE$\rightarrow$EN dataset, our model achieves an impressive BLEU score of 39.59, setting a new state-of-the-art performance. Additionally, our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes