LGCLMLJan 9, 2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

arXiv:1901.02860v34347 citations
Originality Highly original
AI Analysis

This addresses a key bottleneck in language modeling for AI researchers and practitioners by enabling longer-term dependency capture without disrupting coherence.

The paper tackles the problem of fixed-length context limitations in Transformer language models by proposing Transformer-XL, which enables learning dependencies 80% longer than RNNs and 450% longer than vanilla Transformers, achieving state-of-the-art perplexity scores such as 0.99 on enwiki8.

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

Code Implementations37 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes