LGCLCVSep 28, 2023

Transformer-VQ: Linear-Time Transformers via Vector Quantization

arXiv:2309.16354v230 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the scalability bottleneck for long-sequence processing in transformers, enabling faster and more efficient models for applications like language and image generation, though it is an incremental improvement over existing attention mechanisms.

The paper tackles the computational inefficiency of quadratic-time self-attention in transformers by introducing Transformer-VQ, which achieves linear-time attention via vector-quantized keys and a novel caching mechanism, resulting in competitive performance (e.g., 0.99 bpb on Enwik8) and over 3x faster speeds at sequence length 8k.

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes