Quantum Transformer: Accelerating model inference via quantum linear algebra
This work addresses the high computational demands of AI inference for users of large language models, but it is incremental as it adapts existing quantum methods to transformers.
The authors tackled the computational cost of transformer inference in large language models by developing quantum subroutines for key components, demonstrating potential quantum speedup in practical regimes through numerical experiments on open-source LLMs.
Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the lens of fault-tolerant quantum computing. We develop quantum subroutines to construct the building blocks in the transformer, including the self-attention, residual connection with layer normalization, and feed-forward network. As an important subroutine, we show how to efficiently implement the Hadamard product and element-wise functions of matrices on quantum computers. Our algorithm prepares an amplitude encoding of the transformer output, which can be measured for prediction or use in the next layer. We find that the matrix norm of the input sequence plays a dominant role in the quantum complexity. With numerical experiments on open-source LLMs, including for bio-informatics applications, we demonstrate the potential of a quantum speedup for transformer inference in practical regimes.