Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions
This work provides theoretical insights into optimizing batch sizes for SGD+M in high-dimensional settings, which is incremental but relevant for practitioners in machine learning and optimization.
The paper analyzes the dynamics of stochastic gradient descent with momentum (SGD+M) on least squares problems in high dimensions, showing that when batch size exceeds a stability threshold (ICR), it achieves linear convergence at an optimal rate of O(1/√κ), matching full-batch performance with smaller batch sizes.
We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrtκ)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.