CLSep 19, 2021

Do Long-Range Language Models Actually Use Long-Range Context?

arXiv:2109.09115v1687 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the practical utility of long-range context in language models for NLP researchers and practitioners, revealing limitations that are incremental to existing efficiency-focused efforts.

The paper investigates whether long-range Transformer language models effectively use extended context beyond 2K tokens, finding that it only improves predictions for a small subset of tokens (e.g., those copied from distant context) and not for sentence-level tasks, with benefits varying by document type, such as being most helpful for literary novels.

Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the long-range context remain unclear. In this paper, we perform a fine-grained analysis of two long-range Transformer language models (including the \emph{Routing Transformer}, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens. Our results reveal that providing long-range context (i.e., beyond the previous 2K tokens) to these models only improves their predictions on a small set of tokens (e.g., those that can be copied from the distant context) and does not help at all for sentence-level prediction tasks. Finally, we discover that PG-19 contains a variety of different document types and domains, and that long-range context helps most for literary novels (as opposed to textbooks or magazines).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes