LGOCMLMar 13, 2019

Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

arXiv:1903.05662v4419 citations
Originality Highly original
AI Analysis

This work addresses the theoretical gap in using STE for quantized networks, which is crucial for efficient deep learning deployment on resource-constrained devices.

The paper tackles the problem of training activation quantized neural networks, where standard gradients vanish due to piecewise constant functions, by providing theoretical justification for the straight-through estimator (STE). It proves that with a proper STE choice, the coarse gradient correlates positively with the population gradient and ensures convergence to a critical point, while poor choices cause instability, as verified on CIFAR-10.

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes