LGAICLFeb 6

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

arXiv:2602.17693v1h-index: 5
Originality Synthesis-oriented
AI Analysis

This provides a practical reference for deploying quantized reasoning models on Ascend NPU, but it is incremental as it applies existing methods to a new platform.

This paper evaluated post-training quantization (PTQ) methods for reasoning LLMs on Ascend NPU, finding that 4-bit weight-only quantization works for larger models, but 4-bit weight-activation schemes cause instability and logic collapse in long-context tasks, while 8-bit quantization remains stable.

Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes