Transformers are Universal Predictors
This work addresses the theoretical understanding of Transformers for researchers in machine learning and natural language processing, but it appears incremental as it builds on existing analysis of the architecture.
The paper investigates the limits and universal prediction capabilities of the Transformer architecture for language modeling, analyzing its performance in non-asymptotic data regimes and validating findings with experiments on synthetic and real datasets.
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.