LG CL MLNov 24, 2023

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang

arXiv:2311.14652v28.84 citationsh-index: 8

Originality Incremental advance

AI Analysis

This addresses the memory bottleneck for deploying LLMs in streaming applications with extremely long contexts (n >> 128K), though it appears incremental as it builds on polynomial approximation methods.

The paper tackles the memory inefficiency of attention computation in large language models for streaming applications with super-long contexts, introducing a one-pass streaming algorithm that uses sublinear space (o(n)) to store sketch matrices while maintaining diminishing error guarantees as token length increases.

Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

View on arXiv PDF

Similar