LGCLOct 18, 2023

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

arXiv:2310.11984v316 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses a fundamental limitation in transformer architectures for algorithmic tasks, with potential applications in more complex domains, though it is incremental as it builds on existing attention mechanisms.

The paper tackled the problem of transformer models failing to generalize to longer sequences in arithmetic tasks like addition and parity, and introduced Attention Bias Calibration (ABC) to enable near-perfect length generalization, solving the Parity task which was a known failure mode.

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes