CLJun 18, 2020

Multi-branch Attentive Transformer

arXiv:2006.10270v219 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a gap in NLP by adapting a successful computer vision technique to improve Transformer models, offering incremental but practical enhancements for researchers and practitioners in sequence learning.

The paper tackles the under-explored use of multi-branch architectures in NLP sequence learning by proposing a multi-branch attentive Transformer (MAT), which averages multiple independent attention branches and uses training techniques like drop-branch and proximal initialization, resulting in significant improvements across machine translation, code generation, and natural language understanding tasks.

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes