Weighted Transformer Network for Machine Translation
This work addresses efficiency and performance issues in machine translation for researchers and practitioners, though it is incremental as it builds on the existing Transformer architecture.
The paper tackles the high computational cost and slow convergence of the Transformer model in machine translation by introducing Weighted Transformer, which replaces multi-head attention with learnable self-attention branches, resulting in a 15-40% faster convergence and improvements of 0.5 and 0.4 BLEU points on English-to-German and English-to-French tasks, respectively.
State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.