Fast Gradient Computation for RoPE Attention in Almost Linear Time
This work addresses a computational bottleneck for researchers and practitioners using RoPE-enhanced Transformers, though it is incremental as it builds on prior forward computation methods.
The paper tackles the problem of inefficient backward gradient computations in RoPE-based attention mechanisms, developing the first almost linear time algorithm for these computations under bounded entries, with a time complexity of n^(1+o(1)).
The Rotary Position Embedding (RoPE) mechanism has become a powerful enhancement to the Transformer architecture, which enables models to capture token relationships when encoding positional information. However, the RoPE mechanisms make the computations of attention mechanisms more complicated, which makes efficient algorithms challenging. Earlier research introduced almost linear time, i.e., $n^{1+o(1)}$ where $n$ is the number of input tokens, algorithms for the forward computation under specific parameter settings. However, achieving a subquadratic time algorithm for other parameter regimes remains impossible unless the widely accepted Strong Exponential Time Hypothesis (SETH) is disproven. In this work, we develop the first almost linear time algorithm for backward computations in the RoPE-based attention under bounded entries. Our approach builds on recent advancements in fast RoPE attention computations, utilizing a novel combination of the polynomial method and the Fast Fourier Transform. Furthermore, we show that with lower bounds derived from the SETH, the bounded entry condition is necessary for subquadratic performance.