CLAIFeb 21, 2025

Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

arXiv:2502.15294v3h-index: 3
Originality Incremental advance
AI Analysis

This addresses efficiency issues in LLM serving systems for real-world dialogue applications, representing an incremental improvement in optimization.

The paper tackles the problem of high GPU memory usage from KV cache in large language models during multi-round conversations by proposing Round Attention, a mechanism that selectively processes top-k relevant rounds, reducing memory usage by 54% to 82% while maintaining answer accuracy.

The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users on the granularity of round and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. Based on this, we propose Round Attention - a novel round-level attention mechanism that selectively processes the KV cache of top-k relevant rounds, where k is dynamically determined through the attention matrix in the watershed layer. Theoretical analysis demonstrates that our method reduces memory usage by 54\% to 82\%, while experimental results confirm that loading sparse critical-round KV cache maintains answer accuracy without performance degradation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes