Spectraformer: A Unified Random Feature Framework for Transformer
This work addresses the computational bottleneck in Transformers for long sequences, offering a systematic approach to improve efficiency, though it is incremental as it builds on existing random feature methods.
The authors tackled the problem of efficiently approximating the attention mechanism in Transformers by introducing Spectraformer, a unified random feature framework that achieves performance comparable to top sparse and low-rank methods on the Long Range Arena benchmark, establishing a new state-of-the-art for random feature-based efficient Transformers.
Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We identify the need for a systematic comparison of different combinations of weight matrices and component functions for attention learning in Transformer. Hence, we introduce Spectraformer, a unified framework for approximating and learning the kernel function in the attention mechanism of the Transformer. Our empirical results demonstrate, for the first time, that a random feature-based approach can achieve performance comparable to top-performing sparse and low-rank methods on the challenging Long Range Arena benchmark. Thus, we establish a new state-of-the-art for random feature-based efficient Transformers. The framework also produces many variants that offer different advantages in accuracy, training time, and memory consumption. Our code is available at: https://github.com/cruiseresearchgroup/spectraformer .