CL AIJul 10, 2025

Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Ting Cai, Zibin Zheng

arXiv:2507.08045v20.001 citationsh-index: 17

AI Analysis55

This addresses a critical bottleneck for efficient multi-turn LLM inference, offering significant speed and storage improvements, though it is incremental as it builds on existing compression techniques.

The paper tackles the problem of inefficient state restoration in multi-turn LLM conversations by introducing Krul, a system that dynamically compresses KV caches based on conversation-specific attention patterns, resulting in a 1.5x-2.68x reduction in time-to-first-token and a 1.33x-2.35x reduction in storage compared to state-of-the-art methods.

Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.

View on arXiv PDF

Similar