On the Limitations and Capabilities of Position Embeddings for Length Generalization
This work addresses the problem of length generalization in Transformers for researchers and practitioners, offering theoretical insights and practical methods, though it is incremental in building on existing PE analysis.
The paper investigates how Position Embeddings (PEs) affect Length Generalization (LG) in Transformers, showing theoretically that PEs structure computations rather than expand capabilities, and empirically supporting a conjecture that LG requires invariant Sequential Representation Complexity across scales. It introduces Scale Hint and Learning-Based Position Embedding to improve LG in reasoning tasks.
In Transformers, Position Embeddings (PEs) significantly influence Length Generalization (LG) performance, yet their fundamental role remains unclear. In this work, we investigate the limitations and capabilities of PEs in achieving LG. We theoretically analyze PEs in Position-Only Linear Attentions (POLAs), introducing Linear Representation Complexity (LRC) to characterize when PEs enable LG. Our analysis shows that PEs do not expand computational capabilities but structure learned computations across positions. Extending to practical Transformers, we propose Sequential Representation Complexity (SRC) and conjecture that LG is possible if and only if SRC remains invariant across scales. We support this hypothesis with empirical evidence in various reasoning tasks. To enhance LG, we introduce Scale Hint, allowing flexible instance scaling, and a Learning-Based Position Embedding framework that automatically learns positional relations. Our work provides theoretical insights and practical strategies for improving LG in Transformers.