LG CLMar 29, 2025

TRA: Better Length Generalisation with Threshold Relative Attention

Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao

arXiv:2503.23174v41 citationsh-index: 53Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This addresses a critical limitation in Transformers for tasks requiring handling longer sequences, though it is incremental as it builds on existing attention mechanisms.

The paper tackled Transformers' poor length generalization by addressing two self-attention failures: inability to remove irrelevant information and positional biases that up-weight irrelevant keys out-of-distribution. It introduced a refactored attention mechanism with selective sparsity and contextualized relative distance, showing substantial improvements in generalization for decoder-only Transformers.

Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.

View on arXiv PDF

Similar