LG AI CLSep 26, 2023

Efficient Post-training Quantization with FP8 Formats

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang

arXiv:2309.14592v221.446 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This work addresses computational efficiency for deploying large models like LLMs and diffusion models, though it is incremental as it builds on existing quantization methods with new data formats.

The study tackled the need for efficient quantization in modern deep learning architectures by evaluating FP8 formats across 75 network architectures, finding that FP8 outperforms INT8 with 92.64% workload coverage versus 65.87% and identifying optimal formats for NLP and computer vision tasks.

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

View on arXiv PDF Code

Similar