Context-aware Biases for Length Extrapolation
This addresses the length extrapolation limitation in Transformers, which is a domain-specific problem for NLP models, and is incremental as it builds on existing Relative Positional Encoding methods.
The paper tackled the problem of Transformers struggling to generalize to longer sequences than seen in training by proposing Context-Aware Biases for Length Extrapolation (CABLE), which learns token-specific, context-aware biases for each attention head, resulting in lower perplexity for GPT-2 Medium on longer sequences and improved performance in long-context retrieval tasks with BERT base.
Transformers often struggle to generalize to longer sequences than those seen during training, a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing either fixed linear biases or globally learned biases, which lack the capacity to adapt to different input contexts. In this work, we propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE), a method that learns token-specific, context-aware biases for each attention head in transformers. By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs. When evaluated on sequences longer than originally trained with, GPT-2 Medium (334M parameters) with CABLE achieves lower perplexity than counterparts using other widely adopted positional encoding methods. Additionally, by applying CABLE to the BERT base model we improved performance in long-context retrieval tasks. Our method significantly enhances the extrapolation performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets. Our code is available at: https://github.com/AlgonetLabs/Cable.