CLLGJul 22, 2021

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

arXiv:2107.10932v14 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient language models by demonstrating that attention mechanisms may be superfluous, potentially benefiting researchers and practitioners in NLP and time-series prediction.

The paper tackles the problem of reducing complexity in Transformer models by replacing self-attention layers with autoregressive Fourier transforms, achieving state-of-the-art performance on Wikitext-103 with 25.8 perplexity compared to a baseline of 24.2 perplexity using half the layers.

In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes