LGMay 8

Finer is Better (with the Right Scaling)

arXiv:2605.0856543.7

AI Analysis

For practitioners deploying quantized LLMs, this work removes a key obstacle to using finer-grained quantization, enabling better quality at ultra-low precision without requiring custom hardware formats.

The paper resolves the paradox that finer block sizes in LLM quantization can degrade quality, showing that with correct algorithmic interventions (e.g., preventing scaling factor underflow, 4-over-6 methodology), finer granularity strictly improves MSE and perplexity, enabling standard formats to match custom wider-exponent formats.

Microscaling is a critical technique for preserving the quality of Large Language Models (LLMs) quantized to ultra-low precision formats. Intuitively, finer block sizes should yield lower quantization error; however, a paradox recently identified in the literature demonstrates that standard abs-max scaling can actually degrade model quality as block sizes shrink. In this work, we investigate the underlying mechanics of this phenomenon. We demonstrate that this degradation is not an inherent limitation of finer granularity, but is primarily driven by heavy-tailed tensor distributions interacting poorly with the coarse upper quantization bins of the FP4 element format. Specifically, we show that i) preventing the scaling factor from underflowing to zero mitigates localized errors, ii) targeted algorithmic interventions like the 4-over-6 methodology effectively correct the quantization geometry for large elements, and iii) a brute-force search establishes an optimal baseline, confirming that the theoretical Mean Squared Error (MSE) strictly improves with finer block sizes. Ultimately, our findings reveal a valuable interchangeability: applying the correct algorithmic recipe allows standard, hardware-compliant formats (like OCP E4M3) to match the performance of custom, wider-exponent formats (like UE5M3). We validate these results across several large language models, fully resolving the block size paradox and achieving robust downstream perplexity improvements.

View on arXiv PDF

Similar