Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
This work addresses the deployment challenge for transformer models in natural language understanding on devices with limited resources, representing an incremental improvement in model compression techniques.
The paper tackles the problem of deploying large transformer models on resource-constrained devices by proposing a quantization-aware tensor-compressed training approach, achieving up to 63x compression ratio with little accuracy loss and significant speedup.
Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63\times$ compression ratio, little accuracy loss and remarkable inference and training speedup.