LGNEJun 21, 2023

Training Transformers with 4-bit Integers

arXiv:2306.11987v288 citationsh-index: 31
Originality Incremental advance
AI Analysis

This work addresses the problem of slow training times for transformers by enabling efficient 4-bit training on current GPUs, which is an incremental improvement over existing methods that require unsupported hardware.

The paper tackles the challenge of training transformers with 4-bit integer arithmetic to accelerate neural network training, achieving competitive accuracy on tasks like natural language understanding, machine translation, and image classification, with a prototypical implementation up to 2.2 times faster than FP16 and speeding up training by up to 35.1%.

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes