LGSep 27, 2024

Cottention: Linear Transformers With Cosine Attention

Gabriel Mongaras, Trevor Dohm, Eric C. Larson

arXiv:2409.18747v111.56 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses memory inefficiency for processing longer sequences in transformer models, offering a promising alternative with potential broad impact in NLP and AI, though it is an incremental improvement over existing attention mechanisms.

The paper tackled the quadratic memory complexity of softmax attention in transformers by introducing Cottention, a novel attention mechanism using cosine similarity, which achieved native linear memory complexity and comparable performance on BERT and GPT tasks while significantly reducing memory requirements.

Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

View on arXiv PDF Code

Similar