Beyond Classical Attention: Quantum Attention for Scalable Computation
This work addresses the core challenge of accelerating LLMs for researchers and practitioners, but it is incremental as it applies an existing quantum algorithm to a specific bottleneck.
The paper tackles the computational overhead of attention matrices in large language models by proposing a quantum attention method using the Grover search algorithm, achieving polynomial-time quantum acceleration and demonstrating low-rank structures in the generated matrices.
As large language models (LLMs) demonstrate outstanding performance across various tasks, attention-driven models have profoundly transformed the field of machine learning. Since attention computations account for the primary computational overhead in both model inference and training, efficiently computing attention matrices has become one of the core challenges in accelerating large language models. It is well-known that quantum machines possess computational advantages over classical machines, and the role of quantum computing in LLMs remains largely unexplored. In this work, we focus on leveraging the Grover search algorithm to efficiently compute a sparse attention matrix. Through comparisons with classical algorithms, we demonstrate that our method achieves quantum acceleration in polynomial time. Additionally, we observe that the generated quantum attention matrices naturally exhibit low-rank structures, providing further theoretical support for efficient modeling. Moreover, within the specific context of attention matrix computation, we conduct a systematic and detailed analysis of the error and time complexity of the proposed algorithm.