CLMay 5, 2023

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

arXiv:2305.03796v1137 citations
Originality Highly original
AI Analysis

This addresses a foundational limitation in Transformer architectures for formal language tasks, with potential applications in improving sequence modeling and extrapolation in natural language processing.

The paper tackles the problem of Transformers' inability to perfectly model regular languages by proposing RegularGPT, a variant that enables successful modeling of languages like PARITY and rediscovers local windowed attention for natural language length extrapolation.

Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes