LG CLApr 8, 2024

Softmax Attention with Constant Cost per Token

arXiv:2404.05843v24.62 citationsh-index: 3Has Code

Originality Highly original

AI Analysis

This work addresses a scalability bottleneck in large-scale language models, offering a potential alternative to conventional attention for more efficient AI systems.

The authors tackled the quadratic computational cost of Transformer attention by proposing a modification that linearizes attention using exponential kernel feature maps, achieving constant time and space complexity per token while maintaining practical functionality.

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.

View on arXiv PDF Code

Similar