Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?
This work provides theoretical insights into when signSGD is more efficient than SGD for linear regression, which is incremental but useful for optimizing training in machine learning.
The paper analyzes scaling laws for signSGD in linear regression under a power-law random features model, identifying unique effects like noise-reshaping that can make signSGD outperform SGD in noise-dominant regimes, with compute-optimal slopes becoming steeper under certain conditions.
We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that the widely used warmup-stable-decay (WSD) schedule further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.