LGCLSep 19, 2022

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

arXiv:2209.09130v2132 citationsh-index: 22
Originality Incremental advance
AI Analysis

This provides a more user-friendly and efficient quantization solution for practitioners in text processing, though it is incremental as it builds on existing quantization methods.

The authors tackled the problem of complex and performance-damaging INT8 quantization for model inference by developing a toolkit with Self-Adaptive Mixed-Precision (SAMP) to automatically balance accuracy and efficiency, achieving higher speedup than PyTorch and FasterTransformer while maintaining required accuracy.

The latest industrial inference engines, such as FasterTransformer and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing INT8 quantization methods are too complicated, and improper usage will lead to model performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance model accuracy and efficiency. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required accuracy. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes