Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects
This work provides theoretical insights into adaptive optimizers like Adam, addressing a known bottleneck in optimization theory, but it is incremental as it builds on existing understanding of signSGD.
The authors tackled the problem of quantitatively understanding the preconditioning and noise-reshaping effects of signSGD in optimization by analyzing it in a high-dimensional limit, deriving limiting SDE and ODE models to describe risk and quantifying four specific effects such as effective learning rate and noise compression.
In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.