CLLGNov 10, 2019

Improving Transformer Models by Reordering their Sublayers

arXiv:1911.03864v21037 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of optimizing transformer architectures for better efficiency and performance in language modeling, though it is incremental as it builds on existing transformer designs and shows limited task generalization.

The authors investigated whether reordering self-attention and feedforward sublayers in transformers could improve performance, finding that a 'sandwich' pattern with more self-attention at the bottom and feedforward layers at the top enhanced perplexity on language modeling benchmarks without increasing parameters, memory, or training time.

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes