CLSep 19, 2025

LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference

Hantao Yang, Hong Xie, Defu Lian, Enhong Chen

arXiv:2509.15515v11 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses cost reduction in LLM inference for users by improving cache selection under heterogeneous queries, representing an incremental advance over prior methods.

The paper tackles the LLM cache bandit problem by addressing query heterogeneity to improve cost-effective LLM inference, achieving a theoretical regret bound of O(sqrt(MNT)) and reducing total cost by approximately 12% in experiments.

This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.

View on arXiv PDF

Similar