LGMar 19, 2024

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

arXiv:2403.12422v236 citationsHas CodeICML
AI Analysis

This work addresses efficiency bottlenecks in transformer pretraining, which is crucial for researchers and practitioners dealing with large-scale models, though it is incremental as it builds on existing quantization techniques.

The paper tackles the problem of slow transformer pretraining by proposing Jetfire, an INT8 fully quantized training method that achieves comparable accuracy to FP16 baselines while offering a 1.42x speedup and 1.49x memory reduction.

Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in transformers due to the high memory access overheads and low-precision computations. In this work, we propose Jetfire, an efficient and accurate INT8 training method specific to transformers. Our method features an INT8 data flow to optimize memory access and a per-block quantization method to maintain the accuracy of pretrained transformers. Extensive experiments demonstrate that our INT8 FQT method achieves comparable accuracy to the FP16 training baseline and outperforms the existing INT8 training works for transformers. Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1.42x and a 1.49x memory reduction compared to the FP16 baseline. Our code is open sourced at https://github.com/thu-ml/Jetfire-INT8Training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes