Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
This provides theoretical insights into Transformer mechanisms for researchers in sequence modeling, though it appears incremental in analyzing existing components.
The paper systematically studies the approximation properties of Transformers for sequence modeling with long, sparse, and complicated memory, revealing the roles of critical parameters like layers and attention heads through theoretical analysis and experimental validation.
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.