CLMLFeb 27, 2024

Latte: Latent Attention for Linear Time Transformers

arXiv:2402.17512v42 citationsh-index: 4Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This addresses the scalability issue for long sequences in transformers, enabling extended context lengths in pre-trained models with minimal additional training, though it is incremental as it builds on existing attention mechanisms.

The paper tackles the quadratic time complexity of standard transformer attention by proposing a probabilistic framework that enables a low-rank linear reparameterization, achieving performance comparable to state-of-the-art models with linear time and memory complexity and constant-time next-token prediction.

The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes