LG AI CLSep 27, 2021

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

arXiv:2109.12948v153.7697 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of high memory and latency in transformers for deployment on resource-limited devices, offering incremental improvements in quantization techniques.

The paper tackles the challenge of quantizing transformer models for efficient deployment by identifying structured outliers in activations that hinder low-bit representation, and introduces per-embedding-group quantization and other methods to achieve state-of-the-art results on the GLUE benchmark with BERT, enabling ultra-low bit-widths for significant memory savings and minimal accuracy loss.

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme -- per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at~\url{https://github.com/qualcomm-ai-research/transformer-quantization}.

View on arXiv PDF Code

Similar