LG CL NEAug 15, 2023

Attention Is Not All You Need Anymore

arXiv:2308.07661v28 citations

Originality Incremental advance

AI Analysis

This addresses a key bottleneck for researchers and practitioners using Transformers in NLP and computer vision, offering a potential performance boost with lower resource demands, though it appears incremental as it modifies rather than replaces the core architecture.

The paper tackles the computational and memory inefficiency of the self-attention mechanism in Transformers by proposing a family of drop-in replacements called Extractors, with the super high-performance Extractor (SHE) improving performance and simplified versions achieving similar or better results with reduced complexity.

In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a family of drop-in replacements for the self-attention mechanism in the Transformer, called the Extractors, is proposed. Four types of the Extractors, namely the super high-performance Extractor (SHE), the higher-performance Extractor (HE), the worthwhile Extractor (WE), and the minimalist Extractor (ME), are proposed as examples. Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer, whereas the simplified versions of the SHE, i.e., the HE, the WE, and the ME, perform close to or better than the self-attention mechanism with less computational and memory complexity. Furthermore, the proposed Extractors have the potential or are able to run faster than the self-attention mechanism since their critical paths of computation are much shorter. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

View on arXiv PDF

Similar