LGCLMLJul 7, 2020

Do Transformers Need Deep Long-Range Memory

arXiv:2007.03356v11014 citations
AI Analysis

This work addresses efficiency and scalability issues in large-scale language models, offering incremental improvements for researchers and practitioners.

The paper investigates whether Transformers require deep long-range memory at every layer, finding that comparable performance can be achieved with 6 times fewer memories and better performance by limiting attention range in lower layers.

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes