LGOct 15, 2024

Error Diffusion: Post Training Quantization with Block-Scaled Number Formats for Neural Networks

arXiv:2410.11203v11 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for efficient quantization methods to reduce hardware costs for deploying neural networks, though it is incremental as it builds on existing block-scaled formats.

The paper tackles the problem of preserving model behavior during post-training quantization by introducing error diffusion, a hyperparameter-free method that diffuses quantization errors across layers without requiring backpropagation or Hessian information, achieving competitive results on vision and large language models.

Quantization reduces the model's hardware costs, such as data movement, storage, and operations like multiply and addition. It also affects the model's behavior by degrading the output quality. Therefore, there is a need for methods that preserve the model's behavior when quantizing model parameters. More exotic numerical encodings, such as block-scaled number formats, have shown advantages for utilizing a fixed bit budget to encode model parameters. This paper presents error diffusion (ED), a hyperparameter-free method for post-training quantization with support for block-scaled data formats. Our approach does not rely on backpropagation or Hessian information. We describe how to improve the quantization process by viewing the neural model as a composite function and diffusing the quantization error in every layer. In addition, we introduce TensorCast, an open-source library based on PyTorch to emulate a variety of number formats, including the block-scaled ones, to aid the research in neural model quantization. We demonstrate the efficacy of our algorithm through rigorous testing on various architectures, including vision and large language models (LLMs), where it consistently delivers competitive results. Our experiments confirm that block-scaled data formats provide a robust choice for post-training quantization and could be used effectively to enhance the practical deployment of advanced neural networks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes