LGSep 15, 2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

arXiv:2309.08593v1h-index: 4
Originality Incremental advance
AI Analysis

This work addresses the theoretical understanding of transformer architectures for researchers, showing that MLPs are not strictly necessary, but it is incremental as it builds on existing transformer theory without immediate practical applications.

The authors proved that MLP neurons can be implemented using masked attention heads with internal dimension 1 for certain activation functions, enabling conversion of MLP-and-attention transformers into attention-only transformers by increasing head count, and demonstrated that attention heads can separately perform MLP components and encode arbitrary masking patterns.

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes