CLJun 18, 2020

Multi-branch Attentive Transformer

Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li, Tie-Yan Liu

arXiv:2006.10270v22.319 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a gap in NLP by adapting a successful computer vision technique to improve Transformer models, offering incremental but practical enhancements for researchers and practitioners in sequence learning.

The paper tackles the under-explored use of multi-branch architectures in NLP sequence learning by proposing a multi-branch attentive Transformer (MAT), which averages multiple independent attention branches and uses training techniques like drop-branch and proximal initialization, resulting in significant improvements across machine translation, code generation, and natural language understanding tasks.

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

View on arXiv PDF Code

Similar