LGMay 9

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

arXiv:2605.0875570.3

Predicted impact top 25% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying large reasoning models, LAQuant reduces accuracy loss from quantization during long autoregressive decoding, enabling faster inference without sacrificing performance.

LAQuant introduces a layer-wise weight quantization method that preserves accuracy on long-decoding reasoning benchmarks, achieving a 15.11pp improvement on AIME25 Pass@1 over ParoQuant for Qwen3-4B under W3G128 quantization, with a 3.42x decoding speedup over FP16.

Large reasoning models (LRMs) reach competition-level math and coding accuracy via long autoregressive decoding, making per-token decoding cost a primary deployment concern. Weight quantization is the standard tool for acceleration, but representative recipes -- including state-of-the-art end-to-end (E2E) QAT -- lose accuracy on long-decoding reasoning benchmarks despite preserving perplexity and short-decode accuracy. Through a systematic gradient-direction analysis, we identify two factors driving this gap: (i) KV-cache fidelity preservation under the QAT loss, which E2E supervision attenuates via the softmax Fisher metric; and (ii) Hessian-subspace alignment between calibration data and the deployment distribution. We propose LookAhead Quantization (LAQuant), a layer-wise weight-only QAT method that addresses both factors without online-transform overhead by combining reasoning-domain calibration with a one-layer lookahead loss whose implicit cross-layer co-adaptation preserves the next-layer residual stream. For Qwen3-4B under W3G128 quantization, LAQuant improves AIME25 Pass@1 over ParoQuant by 15.11pp (1.93pp over ParoQuant++ at matched calibration) while achieving a 3.42x decoding speedup over FP16 on RTX A6000, compared with ParoQuant's 3.01x.

View on arXiv PDF

Similar