LGCVDSFeb 5, 2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

arXiv:2302.02451v256 citationsh-index: 40
AI Analysis

This addresses the efficiency problem for training long-sequence models in deep learning, offering a sub-quadratic method with provable bounds, though it is an incremental improvement over prior approximations.

The paper tackles the quadratic computational bottleneck of dot-product attention in Transformers by approximating attention via kernel density estimation, achieving over 4x speedup in BigGAN image generation with better scores and over 18x speedup in ImageNet classification with less than 0.5% accuracy drop.

Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On BigGAN image generation, we achieve better generative scores than the exact computation with over $4\times$ speedup. For ImageNet classification with T2T-ViT, KDEformer shows over $18\times$ speedup while the accuracy drop is less than $0.5\%$.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes