FNetAR: Mixing Tokens with Autoregressive Fourier Transforms
This work addresses the need for more efficient language models by demonstrating that attention mechanisms may be superfluous, potentially benefiting researchers and practitioners in NLP and time-series prediction.
The paper tackles the problem of reducing complexity in Transformer models by replacing self-attention layers with autoregressive Fourier transforms, achieving state-of-the-art performance on Wikitext-103 with 25.8 perplexity compared to a baseline of 24.2 perplexity using half the layers.
In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.