CLMar 30, 2025

Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference

arXiv:2503.23294v15 citationsh-index: 11DATE
Originality Incremental advance
AI Analysis

This addresses the computational inefficiency in long-context LLM inference, offering a domain-specific optimization for faster and more memory-efficient processing.

The paper tackles the problem of high inference latency and GPU memory usage in long-context LLMs by introducing Cocktail, a chunk-adaptive mixed-precision quantization method for the KV cache, which outperforms state-of-the-art methods in experiments.

Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes