DARTFormer: Finding The Best Type Of Attention
This work addresses the challenge of selecting optimal attention mechanisms for Transformers, which is important for researchers and practitioners in NLP, but it is incremental as it builds on existing NAS and attention methods.
The authors tackled the problem of identifying the most effective attention mechanism for Transformers given a task, proposing a DARTS-like Neural Architecture Search method that found the best attention for IMDb byte-level text classification and Listops, and extended it to build heterogeneous Transformers with multiple attention types, which performed better than average homogeneous models but not the best.
Given the wide and ever growing range of different efficient Transformer attention mechanisms, it is important to identify which attention is most effective when given a task. In this work, we are also interested in combining different attention types to build heterogeneous Transformers. We first propose a DARTS-like Neural Architecture Search (NAS) method to find the best attention for a given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest that NAS is highly effective on this task, and it identifies the best attention mechanisms for IMDb byte level text classification and Listops. We then extend our framework to search for and build Transformers with multiple different attention types, and call them heterogeneous Transformers. We show that whilst these heterogeneous Transformers are better than the average homogeneous models, they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense, and why it ultimately fails.