You Need Better Attention Priors

arXiv:2601.15380v11 citations

Originality Incremental advance

AI Analysis

This work addresses fundamental representational limitations in transformer architectures for the machine learning community, though it appears to be an incremental improvement over existing attention mechanisms.

The authors tackled the limitations of standard attention mechanisms by developing GOAT, a generalized attention mechanism based on Entropic Optimal Transport that replaces implicit uniform priors with learnable continuous priors, achieving improved length generalization and addressing attention sink issues while maintaining compatibility with optimized kernels like FlashAttention.

We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.

View on arXiv PDF

Similar