SGD Convergence under Stepsize Shrinkage in Low-Precision Training
This work addresses convergence issues in low-precision training for deep learning, which is crucial for reducing computational costs, but it is incremental as it extends existing SGD theory to a gradient shrinkage model.
The paper tackles the problem of SGD convergence in low-precision training, where gradient quantization causes shrinkage, and shows that this leads to slower convergence and higher steady-state error, with rates dependent on the minimum shrinkage factor.
Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.