CLJun 23, 2023

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

DeepMind
arXiv:2306.13421v232 citationsh-index: 59
Originality Highly original
AI Analysis

This work addresses the challenge of modeling long texts for applications in natural language processing, though it is incremental as it builds on existing retrieval-augmented language models.

The authors tackled the problem of limited adaptation between retrieval-augmented language models and retrievers by proposing the Retrieval-Pretrained Transformer (RPT), which jointly trains both components from scratch for long-range language modeling, resulting in improved retrieval quality and perplexity across books, code, and mathematical writing tasks.

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes