CL CV LGAug 17, 2022

Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems

arXiv:2208.08191v30.31 citationsh-index: 66

Originality Incremental advance

AI Analysis

This provides theoretical insight into why MLP-based architectures underperform attention mechanisms in NLP and vision tasks, which is important for researchers designing efficient architectures.

The paper analyzes the expressive power of MLP-based architectures compared to attention mechanisms, showing an exponential gap in modeling dependencies between multiple inputs simultaneously. The results provide a theoretical explanation for MLP's inability to compete with attention-based mechanisms in NLP problems and suggest this gap may also explain performance differences in vision tasks.

Vision-Transformers are widely used in various vision tasks. Meanwhile, there is another line of works starting with the MLP-mixer trying to achieve similar performance using mlp-based architectures. Interestingly, until now those mlp-based architectures have not been adapted for NLP tasks. Additionally, until now, mlp-based architectures have failed to achieve state-of-the-art performance in vision tasks. In this paper, we analyze the expressive power of mlp-based architectures in modeling dependencies between multiple different inputs simultaneously, and show an exponential gap between the attention and the mlp-based mechanisms. Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems, they also suggest that the performance gap in vision tasks may be due to the mlp relative weakness in modeling dependencies between multiple different locations, and that combining smart input permutations with mlp architectures may not be enough to close the performance gap alone.

View on arXiv PDF

Similar