CLFeb 10, 2025

Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs

arXiv:2502.06766v23 citationsh-index: 56
Originality Highly original
AI Analysis

This enables long-context transformer inference on commodity hardware, addressing a bottleneck for applications requiring extensive input processing.

The paper tackles the high computational cost of transformer inference on long contexts by proposing a tunable top-k selection mechanism that reduces attention to the most relevant tokens, enabling inference on up to 1M tokens with 16GB GPU RAM while maintaining over 95% performance on benchmarks.

There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long contexts on commodity (i.e not data center scale) hardware. To address the inference time costs associated with running self-attention based transformer language models on long contexts and enable their adoption on widely available hardware, we propose a tunable mechanism that reduces the cost of the forward pass by attending to only the most relevant tokens at every generation step using a top-k selection mechanism. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity induced by the reduced number of keys and values. By attending to less than 2% of input tokens, we achieve over 95% of model performance on common benchmarks (RULER, AlpacaEval, and Open LLM Leaderboard).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes