AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
This addresses efficiency bottlenecks in long-context generative LLM inference for AI applications, representing an incremental improvement over existing sparse attention methods.
The paper tackles the challenges of quadratic attention complexity and KV cache memory in long-context LLM inference by proposing AsyncTLS, a hierarchical sparse attention system that combines block filtering and token selection with asynchronous offloading, achieving accuracy comparable to full attention while delivering 1.2x-10.0x operator speedups and 1.3x-4.7x throughput improvements on 48k-96k contexts.
Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.