CLLGAug 23, 2021

Recurrent multiple shared layers in Depth for Neural Machine Translation

arXiv:2108.10417v23 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of parameter efficiency in deep models for machine translation, offering an incremental improvement over existing methods.

The paper tackles the problem of training deeper neural machine translation models without increasing parameters excessively by proposing a recurrent mechanism that loops encoder and decoder blocks in the depth direction with parameter sharing. The result is a model that outperforms shallow Transformer baselines by 0.35-1.45 BLEU points and achieves similar performance to a deep Transformer with 54.72% of the parameters.

Learning deeper models is usually a simple and effective approach to improve model performance, but deeper models have larger model parameters and are more difficult to train. To get a deeper model, simply stacking more layers of the model seems to work well, but previous works have claimed that it cannot benefit the model. We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction. To address the increasing of model parameters, we choose to share parameters in different recursive moments. We conduct our experiments on WMT16 English-to-German and WMT14 English-to-France translation tasks, our model outperforms the shallow Transformer-Base/Big baseline by 0.35, 1.45 BLEU points, which is 27.23% of Transformer-Big model parameters. Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes