Very Deep Transformers for Neural Machine Translation
This work addresses the problem of improving translation accuracy for machine translation systems, though it is incremental as it builds on existing Transformer architectures.
The paper tackled the challenge of training very deep Transformer models for Neural Machine Translation by introducing a simple initialization technique to stabilize training, resulting in models with up to 60 encoder layers that outperform baseline 6-layer models by up to 2.5 BLEU and achieve new state-of-the-art results, such as 43.8 BLEU on WMT14 English-French.
We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.