LGCLSDASMLMay 18, 2021

Relative Positional Encoding for Transformers with Linear Complexity

arXiv:2105.08399v263 citations
Originality Incremental advance
AI Analysis

This work addresses a bottleneck for researchers and practitioners using long-sequence Transformers by making RPE compatible with efficient linear variants, though it is incremental as it builds on existing RPE and linear Transformer methods.

The paper tackles the problem of enabling relative positional encoding (RPE) in linear-complexity Transformers, which previously required explicit attention matrices, by proposing Stochastic Positional Encoding that mimics RPE behavior, achieving competitive performance on benchmarks like Long-Range Arena and music generation.

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes