LGMLFeb 1, 2023

FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features

Cambridge
arXiv:2302.00787v14 citationsh-index: 49
Originality Highly original
AI Analysis

This work addresses the efficiency bottleneck in kernel approximations and self-attention mechanisms for applications like Transformers, offering a novel optimization approach with significant variance reduction, though it is incremental in improving existing random feature methods.

The paper tackles the problem of approximating linear operators from Gaussian and softmax kernels, which are crucial for kernel methods and efficient Transformers, by introducing parameterized, positive, non-trigonometric random features that allow optimization to reduce variance, achieving up to e^10-times smaller variance and outperforming previous methods in tasks like kernel regression, speech modeling, and natural language processing.

The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes