Transformer-VQ: Linear-Time Transformers via Vector Quantization
This addresses the scalability bottleneck for long-sequence processing in transformers, enabling faster and more efficient models for applications like language and image generation, though it is an incremental improvement over existing attention mechanisms.
The paper tackles the computational inefficiency of quadratic-time self-attention in transformers by introducing Transformer-VQ, which achieves linear-time attention via vector-quantized keys and a novel caching mechanism, resulting in competitive performance (e.g., 0.99 bpb on Enwik8) and over 3x faster speeds at sequence length 8k.
We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}