Sheng Qiang

h-index3

2papers

95citations

2 Papers

16.1LGJul 1

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

Sheng Qiang, Ruiwei Chen, Yinpeng Wu et al.

Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU memory, force smaller batches, and reduce serving throughput. Prior KV cache compression techniques typically target only the sequence dimension or only the channel dimension, which leaves limited headroom as context windows scale. Compressing both dimensions promises higher memory reduction, but applying the two forms of compression directly leads to significant accuracy loss. This paper introduces MosaicKV, a dynamic two-D (dimensional) KV cache compression system for extremely long-context serving. MosaicKV uses dynamic two-D compression to address the accuracy challenge, exploiting the non-uniform importance distribution of elements within the KV cache. Instead of applying one compression pattern globally, MosaicKV identifies important elements for each KV vector and selects compression strategies at the granularity of KV cache segments. To address the performance challenge, where fine-grained sparsity and compression management overhead can offset the gains from compression, MosaicKV introduces compressed KV cache management. This mechanism uses underutilized GPU and CPU resources to maintain compressed KV caches and accelerate attention computation. Evaluation on an H800 GPU with multiple LLMs shows that MosaicKV delivers up to 16x attention speedup, 4.8x lower decode latency, and 7.3x higher throughput than the uncompressed baseline. At the same time, it reduces memory usage by 3x and incurs only 1.76% average accuracy loss on LongBench and RULER.

20.1MLApr 25, 2016

Dynamic Pricing with Demand Covariates

Sheng Qiang, Mohsen Bayati

We consider a firm that sells products over $T$ periods without knowing the demand function. The firm sequentially sets prices to earn revenue and to learn the underlying demand function simultaneously. A natural heuristic for this problem, commonly used in practice, is greedy iterative least squares (GILS). At each time period, GILS estimates the demand as a linear function of the price by applying least squares to the set of prior prices and realized demands. Then a price that maximizes the revenue, given the estimated demand function, is used for the next time period. The performance is measured by the regret, which is the expected revenue loss from the optimal (oracle) pricing policy when the demand function is known. Recently, den Boer and Zwart (2014) and Keskin and Zeevi (2014) demonstrated that GILS is sub-optimal. They introduced algorithms which integrate forced price dispersion with GILS and achieve asymptotically optimal performance. In this paper, we consider this dynamic pricing problem in a data-rich environment. In particular, we assume that the firm knows the expected demand under a particular price from historical data, and in each period, before setting the price, the firm has access to extra information (demand covariates) which may be predictive of the demand. We prove that in this setting GILS achieves asymptotically optimal regret of order $\log(T)$. We also show the following surprising result: in the original dynamic pricing problem of den Boer and Zwart (2014) and Keskin and Zeevi (2014), inclusion of any set of covariates in GILS as potential demand covariates (even though they could carry no information) would make GILS asymptotically optimal. We validate our results via extensive numerical simulations on synthetic and real data sets.