CLMay 1, 2022

Nearest Neighbor Knowledge Distillation for Neural Machine Translation

arXiv:2205.00479v131.9631 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses the deployment difficulty of NN-MT in real-world machine translation applications by reducing decoding costs, though it is incremental as it builds directly on NN-MT.

The paper tackles the high inference cost of k-nearest-neighbor machine translation (NN-MT) by proposing Nearest Neighbor Knowledge Distillation (NN-KD), which moves the nearest neighbor search to preprocessing and trains the base model to learn this knowledge, achieving consistent improvements over state-of-the-art baselines while maintaining standard training and decoding speeds.

k-nearest-neighbor machine translation (NN-MT), proposed by Khandelwal et al. (2021), has achieved many state-of-the-art results in machine translation tasks. Although effective, NN-MT requires conducting NN searches through the large datastore for each decoding step during inference, prohibitively increasing the decoding cost and thus leading to the difficulty for the deployment in real-world applications. In this paper, we propose to move the time-consuming NN search forward to the preprocessing phase, and then introduce Nearest Neighbor Knowledge Distillation (NN-KD) that trains the base NMT model to directly learn the knowledge of NN. Distilling knowledge retrieved by NN can encourage the NMT model to take more reasonable target tokens into consideration, thus addressing the overcorrection problem. Extensive experimental results show that, the proposed method achieves consistent improvement over the state-of-the-art baselines including NN-MT, while maintaining the same training and decoding speed as the standard NMT model.

View on arXiv PDF Code

Similar