Q8BERT: Quantized 8Bit BERT
This work addresses the computational and memory challenges of using large language models in production environments, representing an incremental improvement in model compression techniques.
The authors tackled the problem of deploying large pre-trained Transformer models like BERT in production by developing Q8BERT, a method for quantizing BERT to 8-bit integers during fine-tuning, achieving 4x compression with minimal accuracy loss and potential inference speedup on 8-bit hardware.
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.