CLAILGApr 19, 2023

Scaling Transformer to 1M tokens and beyond with RMT

arXiv:2304.11062v2115 citationsh-index: 20
Originality Highly original
AI Analysis

This addresses the limitation of handling long sequences in transformers, enabling enhanced long-term dependency handling in natural language tasks and large-scale context processing for memory-intensive applications, representing a novel method for a known bottleneck.

The paper tackles the quadratic computational complexity scaling of transformers with input size by using recurrent memory augmentation to extend context length linearly, achieving the capability to store information for up to two million tokens with high retrieval accuracy and showing perplexity improvements in language modeling tasks.

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes