Chefs' Random Tables: Non-Trigonometric Random Features
This addresses the need for efficient kernel approximations in machine learning, particularly for low-rank Transformers, though it appears incremental as it builds on existing random feature methods.
The paper tackles the problem of approximating Gaussian and softmax kernels with random features, introducing chefs' random tables (CRTs) as a non-trigonometric alternative to standard methods, and achieves new state-of-the-art results for low-rank text Transformers with linear space and time complexity.
We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.