LGMLOct 27, 2020

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

arXiv:2010.14298v172 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of accelerating deep neural network training with low-bitwidth hardware for researchers and practitioners, but it is incremental as it builds on existing quantization-aware training methods.

The paper tackles the challenge of understanding gradient quantization's impact on convergence in fully quantized training (FQT) by developing a statistical framework that shows FQT gradients are unbiased estimators and introduces novel quantizers with lower variance. For ResNet-50 on ImageNet, their 5-bit block Householder quantizer achieves only 0.5% validation accuracy loss compared to quantization-aware training, matching an INT8 baseline.

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties. In this paper, we address this problem by presenting a statistical framework for analyzing FQT algorithms. We view the quantized gradient of FQT as a stochastic estimator of its full precision counterpart, a procedure known as quantization-aware training (QAT). We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance. Inspired by these theoretical results, we develop two novel gradient quantizers, and we show that these have smaller variance than the existing per-tensor quantizer. For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5% validation accuracy loss relative to QAT, comparable to the existing INT8 baseline.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes