A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
This provides a practical reference for deploying quantized reasoning models on Ascend NPU, but it is incremental as it applies existing methods to a new platform.
This paper evaluated post-training quantization (PTQ) methods for reasoning LLMs on Ascend NPU, finding that 4-bit weight-only quantization works for larger models, but 4-bit weight-activation schemes cause instability and logic collapse in long-context tasks, while 8-bit quantization remains stable.
Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.