NF4 Isn't Information Theoretically Optimal (and that's Good)
This work addresses a theoretical gap in quantization methods for machine learning, though it appears incremental as it builds directly on existing NF4 quantization.
The paper challenges the claim that NF4 quantization is information-theoretically optimal for normally distributed weights, showing that the distribution depends on block size, and proposes a new code minimizing expected L1 reconstruction error, which improves performance for larger block sizes.
This note shares some simple calculations and experiments related to absmax-based blockwise quantization, as used in Dettmers et al., 2023. Their proposed NF4 data type is said to be information theoretically optimal for representing normally distributed weights. I show that this can't quite be the case, as the distribution of the values to be quantized depends on the block-size. I attempt to apply these insights to derive an improved code based on minimizing the expected L1 reconstruction error, rather than the quantile based method. This leads to improved performance for larger quantization block sizes, while both codes perform similarly at smaller block sizes.