LG AIOct 15, 2024

QSpec: Speculative Decoding with Complementary Quantization Schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

arXiv:2410.11305v317.613 citationsh-index: 7Has CodeEMNLP

Originality Incremental advance

AI Analysis

This provides a practical solution for efficient and high-fidelity LLM serving under memory constraints, though it is incremental as it builds on existing quantization and speculative decoding techniques.

The paper tackles the problem of performance degradation in quantized large language models during multi-step reasoning by proposing QSpec, a speculative decoding method that integrates low-precision joint quantization for drafting and high-precision weight-only quantization for verification, achieving up to 1.64x speedup without quality loss.

Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers from substantial performance degradation on multi-step reasoning tasks. We propose QSpec, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSpec achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSpec supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSpec a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios. Our code is available at https://github.com/hku-netexplo-lab/QSpec.

View on arXiv PDF Code

Similar