Learning Light-Weight Translation Models from Deep Transformer
This work provides a method for reducing the computational and memory footprint of NMT models, which is beneficial for users with resource-constrained environments, representing an incremental improvement in model efficiency.
This paper addresses the computational expense and memory intensity of deep neural machine translation (NMT) models by proposing a group-permutation based knowledge distillation approach. This method compresses a deep Transformer model into a shallow model that is 8X shallower with almost no loss in BLEU score. Additionally, a Skipping Sub-Layer method is introduced to enhance the teacher model, achieving a BLEU score of 30.63 on English-German newstest2014.
Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.